I have trained a masked language model using RoBERTa on clinical data. The model is stored here tdobrxl/ClinicBERT 路 Hugging Face.
Locally, the trained model works pretty well for predicting a masked work with the following code:
fill_mask = pipeline("fill-mask", model="ClinicBERT", tokenizer="ClinicBERT")
fill_mask(text)
For example:
text = "Decitabine Combined With Oxaliplatin in <mask> With Advanced Renal Cell Carcinoma"
fill_mask(text)
[{'score': 0.9417486190795898, 'token': 593, 'token_str': ' Patients', 'sequence': 'Decitabine Combined With Oxaliplatin in Patients With Advanced Renal Cell Carcinoma'}, {'score': 0.024623002856969833, 'token': 943, 'token_str': ' Subjects', 'sequence': 'Decitabine Combined With Oxaliplatin in Subjects With Advanced Renal Cell Carcinoma'}, {'score': 0.0076624322682619095, 'token': 2488, 'token_str': ' Participants', 'sequence': 'Decitabine Combined With Oxaliplatin in Participants With Advanced Renal Cell Carcinoma'}, {'score': 0.0044851102866232395, 'token': 3380, 'token_str': ' Children', 'sequence': 'Decitabine Combined With Oxaliplatin in Children With Advanced Renal Cell Carcinoma'}, {'score': 0.003453735029324889, 'token': 4756, 'token_str': ' Adults', 'sequence': 'Decitabine Combined With Oxaliplatin in Adults With Advanced Renal Cell Carcinoma'}]
However, when I loaded the model from my repository (i.e., tdobrxl/ClinicBERT), it predicts the word randomly.
fill_mask = pipeline("fill-mask", model="tdobrxl/ClinicBERT", tokenizer="tdobrxl/ClinicBERT")
text = "Decitabine Combined With Oxaliplatin in <mask> With Advanced Renal Cell Carcinoma"
fill_mask(text)
[{'score': 0.00024121406022459269, 'token': 13994, 'token_str': 'oproxil', 'sequence': 'Decitabine Combined With Oxaliplatin inoproxil With Advanced Renal Cell Carcinoma'}, {'score': 0.00023664682521484792, 'token': 15167, 'token_str': 'enecid', 'sequence': 'Decitabine Combined With Oxaliplatin inenecid With Advanced Renal Cell Carcinoma'}, {'score': 0.000197140165255405, 'token': 18398, 'token_str': ' enlarged', 'sequence': 'Decitabine Combined With Oxaliplatin in enlarged With Advanced Renal Cell Carcinoma'}, {'score': 0.0001816125150071457, 'token': 4308, 'token_str': ' edema', 'sequence': 'Decitabine Combined With Oxaliplatin in edema With Advanced Renal Cell Carcinoma'}, {'score': 0.0001676036190474406, 'token': 23309, 'token_str': 'ucal', 'sequence': 'Decitabine Combined With Oxaliplatin inucal With Advanced Renal Cell Carcinoma'}]
I notice that if the model is loaded from the repo, it shows the warning:
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
How to load the model from the repo and predict properly?