Where do classes get added as special tokens?

Hi,

I've implemented Donut as a fork of HuggingFace Transformers, and soon I'll add it to the library. The model is implemented as an instance of [VisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder), which allows to combine any vision Transformer encoder (like ViT, Swin) with any text Transformer as decoder (like BERT,  GPT-2, etc.). As Donut exactly did that, it was straightforward to implement it that way.

Here's a [notebook](https://colab.research.google.com/drive/1qlG-ijsQrT7F2af-STpvvq2tt7SgXH8U?usp=sharing) that shows inference with it.

I do have 2 questions though:

- I prepared a toy dataset of RVL-CDIP, in order to illustrate how to fine-tune the model on document image classification. However, I wonder where the different classes get added to the special tokens of the tokenizer + decoder. The toy dataset can be loaded as follows:

```
from datasets import load_dataset

dataset = load_dataset("nielsr/rvl_cdip_10_examples_per_class_donut")
```
when using this dataset when creating an instance of `DonutDataset`, it seems only "<s_class>", "</s_class>" and "<s_rvlcdip>" are added as special tokens. But looking at [this file](https://huggingface.co/naver-clova-ix/donut-base-finetuned-rvlcdip/blob/main/added_tokens.json), it seems that one also defines special tokens for each class. Looking at the [code](https://github.com/clovaai/donut/blob/dd12dae50b38b97b2df757e892904b1c718ad87f/donut/model.py#L510), it seems only keys are added, not values of the dictionaries.

- I've uploaded all weights to the hub, currently they are all hosted under my own name (nielsr). I wonder whether we can transfer them to the naver-clova-ix organization. Of course, the names are already taken for the PyPi package of this repository, so either we can use branches within the Github repos, to specify a specific revision, either we can give priority to either HuggingFace Transformers/this PyPi package for the names.

Let me know what you think!

Kind regards,

Niels
ML Engineer @ HuggingFace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Where do classes get added as special tokens? #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Where do classes get added as special tokens? #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions