I’m trying to use BioBERT (downloaded from the HuggingFace models repository at dmis-lab/biobert-v1.1
) to fill in MASK tokens in text, and I’m getting some unexpected behavior with the suggested tokens.
I pasted a screenshot below comparing bert-base-uncased
(which behaves as expected and has sensible most-likely tokens) with BioBERT:
Here’s the code to reproduce this:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
text = 'heart disease is [MASK] leading cause of death in the united states.'
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
tokenized = tokenizer(text, return_tensors='pt')
idx = tokenizer.convert_ids_to_tokens(tokenized.input_ids[0]).index(tokenizer.mask_token)
output = model(**tokenized, return_dict=True)
print(tokenizer.convert_ids_to_tokens(torch.topk(output.logits[0, idx, :], 10).indices))
tokenizer = AutoTokenizer.from_pretrained('dmis-lab/biobert-v1.1')
model = AutoModelForMaskedLM.from_pretrained('dmis-lab/biobert-v1.1')
tokenized = tokenizer(text, return_tensors='pt')
idx = tokenizer.convert_ids_to_tokens(tokenized.input_ids[0]).index(tokenizer.mask_token)
output = model(**tokenized, return_dict=True)
print(tokenizer.convert_ids_to_tokens(torch.topk(output.logits[0, idx, :], 10).indices))
And here’s my output from running transformers-cli env
:
- `transformers` version: 4.1.1
- Platform: macOS-10.11.6-x86_64-i386-64bit
- Python version: 3.8.5
- PyTorch version (GPU?): 1.4.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
I also asked about similar issues with PubMedBERT as a Github issue a while back, but haven’t gotten a response.
Do the pretrained weights for these models not contain the components necessary for doing masked language modeling/imputing MASK tokens? Is there any way to fix this issue?