-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Create Vocabulary from both pretrained transformers and instances #5368
Create Vocabulary from both pretrained transformers and instances #5368
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, by and large. I suspect some of the automated tests are going to complain about something.
The main question I have is, how is it supposed to work when multiple transformer vocabs are mapped into the same namespace?
allennlp/data/vocabulary.py
Outdated
type: 'from_pretrained_transformer_and_instances', | ||
transformers: { | ||
'namespace1': 'bert-base-cased', | ||
'namespace2': ['bert-base-cased', 'roberta-base'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is supposed to happen when you put two transformers into the same namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If two models are put into the same namespace, that namespace is extended by the tokens in both models. I don't know why someone might want to do it, but there might be a research reason for it?
This is tested with both test_with_single_namespace_and_multiple_models
and test_with_multiple_models_across_multiple_namespaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the result will be wrong if you do that. Each transformer expects a word piece to map to a certain integer. If a word piece maps to a different integer, the embeddings won't work. You'll probably get an "index out of bounds" exception (if you're lucky). Since we can't map two word pieces to the same integer (and we certainly can't map the same word piece to two different integers), I think we have to disallow taking in two transformer vocabs into the same namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me! I've updated the code to reflect those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More tests than implementation code, I love it :-)
Closes #5355
Changes proposed in this pull request:
Before submitting
section of the
CONTRIBUTING
docs.Writing docstrings section of the
CONTRIBUTING
docs.After submitting
codecov/patch
reports high test coverage (at least 90%).You can find this under the "Actions" tab of the pull request once the other checks have finished.