-
Notifications
You must be signed in to change notification settings - Fork 66
[do not merge] pulling out masking and adding test #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
assert pytest.approx((masked_token_ids == 2).mean(), abs=0.01) == 0.5 * 0.25 | ||
|
||
# Check that the distribution of random tokens is correct. | ||
assert pytest.approx(((masked_token_ids == 5) | (masked_token_ids == 6)).mean(), abs=0.01) == 0.5 * 0.12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test fails for me, but the original model test still passes. This probably explains why I can't get the same loss in #55
# Check that the distribution of random tokens is correct.
> assert pytest.approx(((masked_token_ids == 5) | (masked_token_ids == 6)).mean(), abs=0.01) == 0.5 * 0.12
E assert 0.04529 ± 1.0e-02 == 0.06
E
E comparison failed
E Obtained: 0.06
E Expected: 0.04529 ± 1.0e-02
I suspect there might be something wrong with the logic on this line?
https://github.com/NVIDIA/bionemo-fw-ea/blob/21aa22f4437a460224dfec0ed308a910539857df/sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/dataset.py#L313-L315
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. definitely not right. So per our slack convo just go ahead and fix this stuff, but for the geneformer test the loss on that will be worse after the change with the current 80/10/10 settings. Before you change the target, verify that the rate the model was unintentionally trained with, 80/2/18, produces the same-ish result as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make sure we can set these rates so that as we are doing replication tests we can use the 80/2/18 mix. At some point we should do a head to head on 80/10/10 vs 80/2/18, until then we know we can get good models out of 80/2/18 that are better than the published geneformer at some tasks at least, so that may be a good default for now for Geneformer until we nail down the answer to this head to head mixing question.
More concretely: def test_binomial_logic():
token_ids = np.ones(100_000, dtype=np.int64)
mask_prob = 0.5
mask_token_prob = 0.25
random_token_prob = 0.12
pad_id = 0
probs = np.full(token_ids.shape[0], mask_prob)
probs[token_ids == pad_id] = 0.0
mask = np.random.binomial(1, probs).astype(bool)
mask_tokens_positions = mask & np.random.binomial(1, mask_token_prob, mask.shape).astype(bool)
random_tokens_positions = (mask & np.random.binomial(1, random_token_prob, mask.shape).astype(bool)) & (
~mask_tokens_positions
)
assert pytest.approx(mask.mean(), abs=0.01) == mask_prob
assert pytest.approx(mask_tokens_positions.mean(), abs=0.01) == mask_prob * mask_token_prob
# fails, 0.06 != 0.0437
assert pytest.approx(random_tokens_positions.mean(), abs=0.01) == mask_prob * random_token_prob it's even more obvious with mask_prob = 1.0
mask_token_prob = 0.50
random_token_prob = 0.50 where 50% of the tokens end up with the mask token, but only 25% end up with the random mask. You're doing two independent coin flips which only cover 75% of the tokens, rather than the expected 50 / 50 split between random and mask tokens. I.e., the chance of having a random token with the above code is
above |
Yeah, this is what I was saying where to replicate the tokenizer rate after fixing you would need to set the random token prob to 0.02 rather than 0.1. I believe the expected rate with this version is something like: random_rate=(1-mask_token_rate)*mask_rate. So in the case of requesting 0.8, 0.15, 0.1 you end up getting the right masking rate but the random rate is 0.1*0.2 = 0.02 of masked tokens. So yeah after your fix to get the old behavior I think the settings would be: 0.8, 0.15, 0.02. Is that right? This is what we put in documentation. It's a known bug and I was wanting to just switch to a well tested and standardized masking function for the fix. Training with this mix isn't necessarily wrong, just not what was intended. |
Yup, that fixes it; sorry -- i could have saved myself some debugging time if I had read those comments more closely. Thanks! |
I think there may be a bug in TOT masking for geneformer