Okay, I tried that code you suggested, but I get his error:
OSError: Can't load tokenizer for '/content/my_model'. Make sure that:
- '/content/my_model' is a correct model identifier listed on 'https://huggingface.co/models'
- or '/content/my_model' is the correct path to a directory containing relevant tokenizer files
hey @anon58275033 it’s true that the data collator uses a tokenizer to perform the collation, but you need to provide the tokenizer argument explicitly to the trainer:
The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.
# Let's see how to increase the vocabulary of Bert model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['🥵', '👏'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer)) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer
The error:
NameError Traceback (most recent call last)
<ipython-input-27-203dc3e7172a> in <module>()
1 # Let's see how to increase the vocabulary of Bert model and tokenizer
----> 2 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
3 model = BertModel.from_pretrained('bert-base-uncased')
4
5 num_added_toks = tokenizer.add_tokens(['🥵', '👏'])
NameError: name 'BertTokenizer' is not defined