Create Vocabulary from both pretrained transformers and instances #5368

amitkparekh · 2021-08-19T20:23:08Z

Closes #5355

Changes proposed in this pull request:

Adds new constructor for creating Vocabulary from both pretrained transformer models and instances from datasets.

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

All GitHub Actions jobs for my pull request have passed.
codecov/patch reports high test coverage (at least 90%).
You can find this under the "Actions" tab of the pull request once the other checks have finished.

…ances

dirkgr

Looks good, by and large. I suspect some of the automated tests are going to complain about something.

The main question I have is, how is it supposed to work when multiple transformer vocabs are mapped into the same namespace?

dirkgr · 2021-08-20T16:43:18Z

allennlp/data/vocabulary.py

+                type: 'from_pretrained_transformer_and_instances',
+                transformers: {
+                    'namespace1': 'bert-base-cased',
+                    'namespace2': ['bert-base-cased', 'roberta-base'],


What is supposed to happen when you put two transformers into the same namespace?

If two models are put into the same namespace, that namespace is extended by the tokens in both models. I don't know why someone might want to do it, but there might be a research reason for it?

This is tested with both test_with_single_namespace_and_multiple_models and test_with_multiple_models_across_multiple_namespaces

I think the result will be wrong if you do that. Each transformer expects a word piece to map to a certain integer. If a word piece maps to a different integer, the embeddings won't work. You'll probably get an "index out of bounds" exception (if you're lucky). Since we can't map two word pieces to the same integer (and we certainly can't map the same word piece to two different integers), I think we have to disallow taking in two transformer vocabs into the same namespace.

That makes sense to me! I've updated the code to reflect those changes.

See allenai#5368 (comment) for more information

dirkgr

More tests than implementation code, I love it :-)

amitkparekh added 5 commits August 19, 2021 21:02

Add Vocabulary constructor from both pretrained transformers and inst…

f97b238

…ances

Merge branch 'main' into load-vocab-from-pretrained-and-instances

751fcd9

Merge branch 'main' into load-vocab-from-pretrained-and-instances

32bd0de

undo autoformatting on changelog (sorry!)

9f2eb94

update changelog without autoformatting everything

8278894

dirkgr reviewed Aug 20, 2021

View reviewed changes

dirkgr self-assigned this Aug 23, 2021

amitkparekh added 2 commits August 24, 2021 08:44

Merge branch 'main' into load-vocab-from-pretrained-and-instances

f783c45

Remove allowing multiple pretrained transformers to a single namespace

97eacbf

See allenai#5368 (comment) for more information

dirkgr approved these changes Aug 24, 2021

View reviewed changes

dirkgr enabled auto-merge (squash) August 24, 2021 20:23

dirkgr merged commit 75af38e into allenai:main Aug 24, 2021

amitkparekh deleted the load-vocab-from-pretrained-and-instances branch August 25, 2021 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Vocabulary from both pretrained transformers and instances #5368

Create Vocabulary from both pretrained transformers and instances #5368

amitkparekh commented Aug 19, 2021 •

edited

Loading

dirkgr left a comment

dirkgr Aug 20, 2021

amitkparekh Aug 21, 2021

dirkgr Aug 23, 2021

amitkparekh Aug 24, 2021

dirkgr left a comment

Create Vocabulary from both pretrained transformers and instances #5368

Create Vocabulary from both pretrained transformers and instances #5368

Conversation

amitkparekh commented Aug 19, 2021 • edited Loading

Before submitting

After submitting

dirkgr left a comment

Choose a reason for hiding this comment

dirkgr Aug 20, 2021

Choose a reason for hiding this comment

amitkparekh Aug 21, 2021

Choose a reason for hiding this comment

dirkgr Aug 23, 2021

Choose a reason for hiding this comment

amitkparekh Aug 24, 2021

Choose a reason for hiding this comment

dirkgr left a comment

Choose a reason for hiding this comment

amitkparekh commented Aug 19, 2021 •

edited

Loading