How to input Scibert in run_mlm? (Is it possible?)

Hi there,

I’m trying to further train from the scibert_scivocab_uncased model, using the run_mlm script. I’ve had no issues further training from BERT_base and RoBERTa but I’m a bit stuck with sciBERT.

SciBERT is not one of the basic models you can directly call from run_mlm.py
So I downloaded the model from allenai/scibert_scivocab_uncased at main

to run:
"python myrun_mlm.py "
"--model_name_or_path=scibert_scivocab_uncased "

But the tokenizer files (tokenizer.json, tokenizer_config.json,…) are missing so it’s not working. I can’t find the tokenizers files in the allenAI scibert git repo either.

What am I missing there?
Thanks for the help! :hugs:

Alternative solution: not using the run-mlm.py script, but a code inspired from: lordtt13/COVID-SciBERT · Hugging Face / word-embeddings/COVID-SciBERT.ipynb at master · lordtt13/word-embeddings · GitHub Thanks @lordtt13 !!

with:
tokenizer = transformers.AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = transformers.AutoModelWithLMHead.from_pretrained('allenai/scibert_scivocab_uncased').to('cuda')

Still would be interesting to know if the integration of scibert as a pre-trained model is planned for run_mlm.py