Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Create Vocabulary from both pretrained transformers and instances #5368

Merged

Conversation

amitkparekh
Copy link
Contributor

@amitkparekh amitkparekh commented Aug 19, 2021

Closes #5355

Changes proposed in this pull request:

  • Adds new constructor for creating Vocabulary from both pretrained transformer models and instances from datasets.

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.
  • codecov/patch reports high test coverage (at least 90%).
    You can find this under the "Actions" tab of the pull request once the other checks have finished.

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, by and large. I suspect some of the automated tests are going to complain about something.

The main question I have is, how is it supposed to work when multiple transformer vocabs are mapped into the same namespace?

type: 'from_pretrained_transformer_and_instances',
transformers: {
'namespace1': 'bert-base-cased',
'namespace2': ['bert-base-cased', 'roberta-base'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is supposed to happen when you put two transformers into the same namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two models are put into the same namespace, that namespace is extended by the tokens in both models. I don't know why someone might want to do it, but there might be a research reason for it?

This is tested with both test_with_single_namespace_and_multiple_models and test_with_multiple_models_across_multiple_namespaces

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the result will be wrong if you do that. Each transformer expects a word piece to map to a certain integer. If a word piece maps to a different integer, the embeddings won't work. You'll probably get an "index out of bounds" exception (if you're lucky). Since we can't map two word pieces to the same integer (and we certainly can't map the same word piece to two different integers), I think we have to disallow taking in two transformer vocabs into the same namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me! I've updated the code to reflect those changes.

@dirkgr dirkgr self-assigned this Aug 23, 2021
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests than implementation code, I love it :-)

@dirkgr dirkgr enabled auto-merge (squash) August 24, 2021 20:23
@dirkgr dirkgr merged commit 75af38e into allenai:main Aug 24, 2021
@amitkparekh amitkparekh deleted the load-vocab-from-pretrained-and-instances branch August 25, 2021 07:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option for creating Vocabulary from pretrained transformers and instances
2 participants