What's the difference between bart-base tokenizer and bart-large tokenizer


tokenizer1 = BartTokenizer.from_pretrained('facebook/bart-base')
tokenizer2 = BartTokenizer.from_pretrained('facebook/bart-large')

What’s the difference conceptually? I can understand the diff in uncased and cased ones for bert.
But why this?
btw, bart base and large have the same “vocab_size”: 50265 in their config.

1 Like

It is obviously related to more number of parameters used in the bart-large as mentioned in the description.
facebook/bart-large 24-layer, 1024-hidden, 16-heads, 406M parameters
facebook/bart-base 12-layer, 768-hidden, 16-heads, 139M parameters

Thanks for reply. but why is a tokenizer dependent on the number of model’s parameters? isn’t it just responsible for text tokenization for corpus and not related to model’s size?

1 Like

Easy there with the “obviously”. This isn’t obvious, because as @zuujhyt rightfully says, the number of parameters is typically not directly related with the vocab. I.e. the vocab embeddings index often do not change between small/large models, but the model’s blocks get wider and/or deeper. I think this is a good question.

cc @patrickvonplaten

1 Like

Agree, would like to know more about it

Those tokenizers are identical. You can check it by just comparing the files over at https://huggingface.co/facebook/bart-base/tree/main and https://huggingface.co/facebook/bart-large/tree/main

Incidentally, they’re also the same as the ones for roberta-* models.

We duplicate tokenizers into their models for ease of use (a model id is all you need)


I understand, thanks!