[do not merge] pulling out masking and adding test #69

pstjohn · 2024-08-02T02:52:44Z

I think there may be a bug in TOT masking for geneformer

pstjohn · 2024-08-02T02:55:19Z

sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_dataset.py

+    assert pytest.approx((masked_token_ids == 2).mean(), abs=0.01) == 0.5 * 0.25
+
+    # Check that the distribution of random tokens is correct.
+    assert pytest.approx(((masked_token_ids == 5) | (masked_token_ids == 6)).mean(), abs=0.01) == 0.5 * 0.12


this test fails for me, but the original model test still passes. This probably explains why I can't get the same loss in #55

# Check that the distribution of random tokens is correct. > assert pytest.approx(((masked_token_ids == 5) | (masked_token_ids == 6)).mean(), abs=0.01) == 0.5 * 0.12 E assert 0.04529 ± 1.0e-02 == 0.06 E E comparison failed E Obtained: 0.06 E Expected: 0.04529 ± 1.0e-02

I suspect there might be something wrong with the logic on this line?
https://github.com/NVIDIA/bionemo-fw-ea/blob/21aa22f4437a460224dfec0ed308a910539857df/sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/dataset.py#L313-L315

Yup. definitely not right. So per our slack convo just go ahead and fix this stuff, but for the geneformer test the loss on that will be worse after the change with the current 80/10/10 settings. Before you change the target, verify that the rate the model was unintentionally trained with, 80/2/18, produces the same-ish result as before.

let's make sure we can set these rates so that as we are doing replication tests we can use the 80/2/18 mix. At some point we should do a head to head on 80/10/10 vs 80/2/18, until then we know we can get good models out of 80/2/18 that are better than the published geneformer at some tasks at least, so that may be a good default for now for Geneformer until we nail down the answer to this head to head mixing question.

pstjohn · 2024-08-02T03:04:23Z

More concretely:

def test_binomial_logic():
    token_ids = np.ones(100_000, dtype=np.int64)
    mask_prob = 0.5
    mask_token_prob = 0.25
    random_token_prob = 0.12
    pad_id = 0

    probs = np.full(token_ids.shape[0], mask_prob)
    probs[token_ids == pad_id] = 0.0
    mask = np.random.binomial(1, probs).astype(bool)
    mask_tokens_positions = mask & np.random.binomial(1, mask_token_prob, mask.shape).astype(bool)
    random_tokens_positions = (mask & np.random.binomial(1, random_token_prob, mask.shape).astype(bool)) & (
        ~mask_tokens_positions
    )

    assert pytest.approx(mask.mean(), abs=0.01) == mask_prob
    assert pytest.approx(mask_tokens_positions.mean(), abs=0.01) == mask_prob * mask_token_prob 

    # fails, 0.06 != 0.0437
    assert pytest.approx(random_tokens_positions.mean(), abs=0.01) == mask_prob * random_token_prob

it's even more obvious with

mask_prob = 1.0
mask_token_prob = 0.50
random_token_prob = 0.50

where 50% of the tokens end up with the mask token, but only 25% end up with the random mask. You're doing two independent coin flips which only cover 75% of the tokens, rather than the expected 50 / 50 split between random and mask tokens.

I.e., the chance of having a random token with the above code is

random_token_prob * mask_prob * (1 - mask_token_prob) = 0.045

above

jstjohn · 2024-08-02T03:52:49Z

Yeah, this is what I was saying where to replicate the tokenizer rate after fixing you would need to set the random token prob to 0.02 rather than 0.1. I believe the expected rate with this version is something like:

random_rate=(1-mask_token_rate)*mask_rate.

So in the case of requesting 0.8, 0.15, 0.1 you end up getting the right masking rate but the random rate is 0.1*0.2 = 0.02 of masked tokens. So yeah after your fix to get the old behavior I think the settings would be: 0.8, 0.15, 0.02.

Is that right? This is what we put in documentation. It's a known bug and I was wanting to just switch to a well tested and standardized masking function for the fix.

Training with this mix isn't necessarily wrong, just not what was intended.

pstjohn · 2024-08-02T16:16:15Z

Yup, that fixes it; sorry -- i could have saved myself some debugging time if I had read those comments more closely. Thanks!

pulling out masking and adding test

e4d366b

pstjohn requested a review from jstjohn August 2, 2024 02:52

pstjohn commented Aug 2, 2024

View reviewed changes

pstjohn changed the title ~~pulling out masking and adding test~~ [do not merge] pulling out masking and adding test Aug 2, 2024

pstjohn mentioned this pull request Aug 2, 2024

Add BERT-style masking function #55

Merged

pstjohn self-assigned this Aug 2, 2024

test fix

c3299c5

pstjohn closed this Aug 2, 2024

pstjohn deleted the pstjohn/v2-main/masking-issue branch September 16, 2024 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[do not merge] pulling out masking and adding test #69

[do not merge] pulling out masking and adding test #69

Uh oh!

pstjohn commented Aug 2, 2024

Uh oh!

pstjohn Aug 2, 2024 •

edited

Loading

Uh oh!

jstjohn Aug 2, 2024 •

edited

Loading

Uh oh!

jstjohn Aug 2, 2024

Uh oh!

pstjohn commented Aug 2, 2024 •

edited

Loading

Uh oh!

jstjohn commented Aug 2, 2024

Uh oh!

pstjohn commented Aug 2, 2024

Uh oh!

Uh oh!

[do not merge] pulling out masking and adding test #69

[do not merge] pulling out masking and adding test #69

Uh oh!

Conversation

pstjohn commented Aug 2, 2024

Uh oh!

pstjohn Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jstjohn Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jstjohn Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

pstjohn commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jstjohn commented Aug 2, 2024

Uh oh!

pstjohn commented Aug 2, 2024

Uh oh!

Uh oh!

pstjohn Aug 2, 2024 •

edited

Loading

jstjohn Aug 2, 2024 •

edited

Loading

pstjohn commented Aug 2, 2024 •

edited

Loading