Load pretrained model's tokenizer with or without vocabulary?

Sabs101 · August 28, 2024, 7:23am

I am having some confusion loading a pre-trained tokenizer of a model. I’m trying to use someone’s model from their repo, they have following saved:

checkpoints of their fine-tuned bert uncased model
vocabulary files (multiple), probably all generated after training a BertWordPieceTokenizer

Now, if I want to fine-tune model on my data, how do I load pre-trained tokenizer and use it on my dataset? Do I use vocabulary files or should I load my model as tokenizer as I have seen in an example here:

Please guide me. Thanks

mahmutc · August 28, 2024, 9:26am

hi @Sabs101
Please check Fine-tuning a model with the Trainer API - Hugging Face NLP Course.

The most important part should be writing tokenize_function if you need:

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

We can help more if you share your initial model(probably it’s bert-base-uncased) and some lines from your dataset.

Sabs101 · August 30, 2024, 4:40am

Hi @mahmutc,

Thankyou for sharing link with me, but confusion still persists. I want to know how I can load my tokenizer (pre-trained) for using it on my own datasaet, should I load it as I load the model or if vocab file is present with the model, can I do .from_pretrained(‘vocab.txt’) and load my tokenizer?

The model I am using is a fine-tuned version of original ‘bert-base-uncased’, fine-tuned for a different language. Here is what I have from the model:

Attaching a link to it as well, if it helps Roman_Urdu_BERT/roman_urdu at master · usamakh20/Roman_Urdu_BERT · GitHub

Topic		Replies	Views
Load fine tuned model from local Beginners	4	10285	October 20, 2020
Can't load pre-trained tokenizer with additional new tokens 🤗Transformers	3	4423	August 10, 2021
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8418	November 14, 2024
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4381	February 20, 2022
Push model to hugging face hub without Trainer Intermediate	7	1413	May 14, 2024

Load pretrained model's tokenizer with or without vocabulary?

Related topics