Skip to content

Where do classes get added as special tokens? #10

Closed
@NielsRogge

Description

@NielsRogge

Hi,

I've implemented Donut as a fork of HuggingFace Transformers, and soon I'll add it to the library. The model is implemented as an instance of VisionEncoderDecoderModel, which allows to combine any vision Transformer encoder (like ViT, Swin) with any text Transformer as decoder (like BERT, GPT-2, etc.). As Donut exactly did that, it was straightforward to implement it that way.

Here's a notebook that shows inference with it.

I do have 2 questions though:

  • I prepared a toy dataset of RVL-CDIP, in order to illustrate how to fine-tune the model on document image classification. However, I wonder where the different classes get added to the special tokens of the tokenizer + decoder. The toy dataset can be loaded as follows:
from datasets import load_dataset

dataset = load_dataset("nielsr/rvl_cdip_10_examples_per_class_donut")

when using this dataset when creating an instance of DonutDataset, it seems only "<s_class>", "</s_class>" and "<s_rvlcdip>" are added as special tokens. But looking at this file, it seems that one also defines special tokens for each class. Looking at the code, it seems only keys are added, not values of the dictionaries.

  • I've uploaded all weights to the hub, currently they are all hosted under my own name (nielsr). I wonder whether we can transfer them to the naver-clova-ix organization. Of course, the names are already taken for the PyPi package of this repository, so either we can use branches within the Github repos, to specify a specific revision, either we can give priority to either HuggingFace Transformers/this PyPi package for the names.

Let me know what you think!

Kind regards,

Niels
ML Engineer @ HuggingFace

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions