Loading model from repository does not return expected result

tdobrxl · July 28, 2022, 3:57pm

I have trained a masked language model using RoBERTa on clinical data. The model is stored here tdobrxl/ClinicBERT · Hugging Face.

Locally, the trained model works pretty well for predicting a masked work with the following code:

fill_mask = pipeline("fill-mask", model="ClinicBERT", tokenizer="ClinicBERT")
fill_mask(text)

For example:

text = "Decitabine Combined With Oxaliplatin in <mask> With Advanced Renal Cell Carcinoma"

fill_mask(text)
[{'score': 0.9417486190795898, 'token': 593, 'token_str': ' Patients', 'sequence': 'Decitabine Combined With Oxaliplatin in Patients With Advanced Renal Cell Carcinoma'}, {'score': 0.024623002856969833, 'token': 943, 'token_str': ' Subjects', 'sequence': 'Decitabine Combined With Oxaliplatin in Subjects With Advanced Renal Cell Carcinoma'}, {'score': 0.0076624322682619095, 'token': 2488, 'token_str': ' Participants', 'sequence': 'Decitabine Combined With Oxaliplatin in Participants With Advanced Renal Cell Carcinoma'}, {'score': 0.0044851102866232395, 'token': 3380, 'token_str': ' Children', 'sequence': 'Decitabine Combined With Oxaliplatin in Children With Advanced Renal Cell Carcinoma'}, {'score': 0.003453735029324889, 'token': 4756, 'token_str': ' Adults', 'sequence': 'Decitabine Combined With Oxaliplatin in Adults With Advanced Renal Cell Carcinoma'}]

However, when I loaded the model from my repository (i.e., tdobrxl/ClinicBERT), it predicts the word randomly.

fill_mask = pipeline("fill-mask", model="tdobrxl/ClinicBERT", tokenizer="tdobrxl/ClinicBERT")
text = "Decitabine Combined With Oxaliplatin in <mask> With Advanced Renal Cell Carcinoma"
fill_mask(text)

[{'score': 0.00024121406022459269, 'token': 13994, 'token_str': 'oproxil', 'sequence': 'Decitabine Combined With Oxaliplatin inoproxil With Advanced Renal Cell Carcinoma'}, {'score': 0.00023664682521484792, 'token': 15167, 'token_str': 'enecid', 'sequence': 'Decitabine Combined With Oxaliplatin inenecid With Advanced Renal Cell Carcinoma'}, {'score': 0.000197140165255405, 'token': 18398, 'token_str': ' enlarged', 'sequence': 'Decitabine Combined With Oxaliplatin in enlarged With Advanced Renal Cell Carcinoma'}, {'score': 0.0001816125150071457, 'token': 4308, 'token_str': ' edema', 'sequence': 'Decitabine Combined With Oxaliplatin in edema With Advanced Renal Cell Carcinoma'}, {'score': 0.0001676036190474406, 'token': 23309, 'token_str': 'ucal', 'sequence': 'Decitabine Combined With Oxaliplatin inucal With Advanced Renal Cell Carcinoma'}]

I notice that if the model is loaded from the repo, it shows the warning:

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

How to load the model from the repo and predict properly?

tdobrxl · July 29, 2022, 10:31pm

Turned out that when pushing to the hub, I used RobertaModel instead of RobertaForMaskedLM.
Using the right model solved the issue.

Topic		Replies	Views
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024
Reloading a saved fine-tuned model trained using the Trainer Object from Huggingface does not yield correct predictions Beginners	0	731	March 15, 2022
Pipeline fill-mask error with custom Roberta tokenizer Beginners	1	847	February 8, 2022
Different embeddings when load model from_tf and save to torch 🤗Transformers	0	379	February 28, 2023
How to make a model predict on only some tokens Beginners	1	599	June 16, 2022

Loading model from repository does not return expected result

Related topics