Fine-tuning a locally saved model on NER task

A few days ago I further prep-trained nlpaueb/legal-bert-base-uncased (nlpaueb/legal-bert-base-uncased · Hugging Face) model for masked token prediction task on a custom dataset using run_mlm.py. After training being done, I saved the model in a local directory. I can see pytorch_model.bin, config.json, all other required files in this directory.

Now I want to fine-tune this saved model on NER task using run_ner.py (in transformers 4.6.0) with another custom training data using the following command

python3 run_ner.py --model_name_or_path path-to-the-saved-model --train_file path-to-the-json-training-data --validation_file path-to-the-json-validation-data --do_train t --do_eval t --output_dir path-to-output-dir --overwrite_output_dir t --label_all_tokens t --return_entity_level_metrics t &> log-file.txt

log file shows the following error

Some weights of BertForTokenClassification were not initialized from the model checkpoint at /home/ubuntu/trained-model/legalbert_further_trained/ and are newly initialized: [‘classifier.bias’, ‘classifier.weight’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
07/13/2021 18:17:44 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/json/default-e94e893514cb3fcb/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02/cache-361f604016d1c665.arrow
07/13/2021 18:17:44 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/json/default-e94e893514cb3fcb/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02/cache-a77e5011b7db5bd5.arrow
07/13/2021 18:17:44 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/json/default-e94e893514cb3fcb/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02/cache-5659054e40c94504.arrow
07/13/2021 18:17:44 - WARNING - datasets.load - Using the latest cached version of the module from /home/ubuntu/.cache/huggingface/modules/datasets_modules/metrics/seqeval/1fde2544ac1f3f7e54c639c73221d3a5e5377d2213b9b0fdb579b96980b84b2e (last modified on Fri Jul 2 16:39:09 2021) since it couldn’t be found locally at seqeval/seqeval.py or remotely (ConnectionError).
[INFO|trainer.py:1047] 2021-07-13 18:17:48,569 >> Loading model from /home/ubuntu/trained-model/legalbert_further_trained/).
Traceback (most recent call last):
File “run_ner.py”, line 533, in
main()
File “run_ner.py”, line 478, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “/home/ubuntu/.local/lib/python3.6/site-packages/transformers/trainer.py”, line 1066, in train
self._load_state_dict_in_model(state_dict)
File “/home/ubuntu/.local/lib/python3.6/site-packages/transformers/trainer.py”, line 1387, in _load_state_dict_in_model
if set(load_result.missing_keys) == set(self.model._keys_to_ignore_on_save):
TypeError: ‘NoneType’ object is not iterable.

Note that if I replace [–model_name_or_path path-to-the-saved-model] with [–model_name_or_path nlpaueb/legalbert-base-uncased] (or any other existing language model that I haven’t further pre-trained), run_ner.py runs without any issue.

Hey, I am in a similar situation: Trying to fine-tune a BERT model that has been saved locally with additional tokens added to its tokenizer vocabulary. I am getting the same error as you have described.
Was anyone able to identify the root cause of this issue?

In case anyone else is facing a similar issue: I found that I was adding new tokens to a BERT model using AutoModel and then trying to fine-tune using AutoModelForSequenceClassification. So, the fix was to load the BERT model as a Sequence Classification model itself using AutoModelForSequenceClassification when adding new tokens to the tokenizer.