Skip to content

[do not merge] pulling out masking and adding test #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

pstjohn
Copy link
Collaborator

@pstjohn pstjohn commented Aug 2, 2024

I think there may be a bug in TOT masking for geneformer

@pstjohn pstjohn requested a review from jstjohn August 2, 2024 02:52
assert pytest.approx((masked_token_ids == 2).mean(), abs=0.01) == 0.5 * 0.25

# Check that the distribution of random tokens is correct.
assert pytest.approx(((masked_token_ids == 5) | (masked_token_ids == 6)).mean(), abs=0.01) == 0.5 * 0.12
Copy link
Collaborator Author

@pstjohn pstjohn Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test fails for me, but the original model test still passes. This probably explains why I can't get the same loss in #55

        # Check that the distribution of random tokens is correct.
>       assert pytest.approx(((masked_token_ids == 5) | (masked_token_ids == 6)).mean(), abs=0.01) == 0.5 * 0.12
E       assert 0.04529 ± 1.0e-02 == 0.06
E         
E         comparison failed
E         Obtained: 0.06
E         Expected: 0.04529 ± 1.0e-02

I suspect there might be something wrong with the logic on this line?
https://github.com/NVIDIA/bionemo-fw-ea/blob/21aa22f4437a460224dfec0ed308a910539857df/sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/dataset.py#L313-L315

Copy link
Collaborator

@jstjohn jstjohn Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. definitely not right. So per our slack convo just go ahead and fix this stuff, but for the geneformer test the loss on that will be worse after the change with the current 80/10/10 settings. Before you change the target, verify that the rate the model was unintentionally trained with, 80/2/18, produces the same-ish result as before.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure we can set these rates so that as we are doing replication tests we can use the 80/2/18 mix. At some point we should do a head to head on 80/10/10 vs 80/2/18, until then we know we can get good models out of 80/2/18 that are better than the published geneformer at some tasks at least, so that may be a good default for now for Geneformer until we nail down the answer to this head to head mixing question.

@pstjohn
Copy link
Collaborator Author

pstjohn commented Aug 2, 2024

More concretely:

def test_binomial_logic():
    token_ids = np.ones(100_000, dtype=np.int64)
    mask_prob = 0.5
    mask_token_prob = 0.25
    random_token_prob = 0.12
    pad_id = 0

    probs = np.full(token_ids.shape[0], mask_prob)
    probs[token_ids == pad_id] = 0.0
    mask = np.random.binomial(1, probs).astype(bool)
    mask_tokens_positions = mask & np.random.binomial(1, mask_token_prob, mask.shape).astype(bool)
    random_tokens_positions = (mask & np.random.binomial(1, random_token_prob, mask.shape).astype(bool)) & (
        ~mask_tokens_positions
    )

    assert pytest.approx(mask.mean(), abs=0.01) == mask_prob
    assert pytest.approx(mask_tokens_positions.mean(), abs=0.01) == mask_prob * mask_token_prob 

    # fails, 0.06 != 0.0437
    assert pytest.approx(random_tokens_positions.mean(), abs=0.01) == mask_prob * random_token_prob

it's even more obvious with

mask_prob = 1.0
mask_token_prob = 0.50
random_token_prob = 0.50

where 50% of the tokens end up with the mask token, but only 25% end up with the random mask. You're doing two independent coin flips which only cover 75% of the tokens, rather than the expected 50 / 50 split between random and mask tokens.

I.e., the chance of having a random token with the above code is

random_token_prob * mask_prob * (1 - mask_token_prob) = 0.045

above

@pstjohn pstjohn changed the title pulling out masking and adding test [do not merge] pulling out masking and adding test Aug 2, 2024
@jstjohn
Copy link
Collaborator

jstjohn commented Aug 2, 2024

Yeah, this is what I was saying where to replicate the tokenizer rate after fixing you would need to set the random token prob to 0.02 rather than 0.1. I believe the expected rate with this version is something like:

random_rate=(1-mask_token_rate)*mask_rate.

So in the case of requesting 0.8, 0.15, 0.1 you end up getting the right masking rate but the random rate is 0.1*0.2 = 0.02 of masked tokens. So yeah after your fix to get the old behavior I think the settings would be: 0.8, 0.15, 0.02.

Is that right? This is what we put in documentation. It's a known bug and I was wanting to just switch to a well tested and standardized masking function for the fix.

Training with this mix isn't necessarily wrong, just not what was intended.

@pstjohn pstjohn self-assigned this Aug 2, 2024
@pstjohn
Copy link
Collaborator Author

pstjohn commented Aug 2, 2024

Yup, that fixes it; sorry -- i could have saved myself some debugging time if I had read those comments more closely. Thanks!

@pstjohn pstjohn closed this Aug 2, 2024
@pstjohn pstjohn deleted the pstjohn/v2-main/masking-issue branch September 16, 2024 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants