Strange output using BioBERT for imputing MASK tokens

rahuln · December 31, 2020, 10:13pm

I’m trying to use BioBERT (downloaded from the HuggingFace models repository at dmis-lab/biobert-v1.1) to fill in MASK tokens in text, and I’m getting some unexpected behavior with the suggested tokens.

I pasted a screenshot below comparing bert-base-uncased (which behaves as expected and has sensible most-likely tokens) with BioBERT:

Here’s the code to reproduce this:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

text = 'heart disease is [MASK] leading cause of death in the united states.'

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
tokenized = tokenizer(text, return_tensors='pt')
idx = tokenizer.convert_ids_to_tokens(tokenized.input_ids[0]).index(tokenizer.mask_token)
output = model(**tokenized, return_dict=True)
print(tokenizer.convert_ids_to_tokens(torch.topk(output.logits[0, idx, :], 10).indices))

tokenizer = AutoTokenizer.from_pretrained('dmis-lab/biobert-v1.1')
model = AutoModelForMaskedLM.from_pretrained('dmis-lab/biobert-v1.1')
tokenized = tokenizer(text, return_tensors='pt')
idx = tokenizer.convert_ids_to_tokens(tokenized.input_ids[0]).index(tokenizer.mask_token)
output = model(**tokenized, return_dict=True)
print(tokenizer.convert_ids_to_tokens(torch.topk(output.logits[0, idx, :], 10).indices))

And here’s my output from running transformers-cli env:

- `transformers` version: 4.1.1
- Platform: macOS-10.11.6-x86_64-i386-64bit
- Python version: 3.8.5
- PyTorch version (GPU?): 1.4.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

I also asked about similar issues with PubMedBERT as a Github issue a while back, but haven’t gotten a response.

Do the pretrained weights for these models not contain the components necessary for doing masked language modeling/imputing MASK tokens? Is there any way to fix this issue?

rgwatwormhill · January 4, 2021, 5:18pm

Hi,

I am not an expert, but that is what it looks like to me.

Masked Language Modelling is usually used during pre-training, and is often not needed during fine-tuning, so I guess the DIMS team didn’t think the MLM parameters would be required.

I notice that the DIMS team have provided 5 models. Do any of the other models have MLM parameters?

It should certainly be possible to copy the DIMS weights into your own model, where your own model does include an MLM head. I expect you would then need to train your model before it would give sensible answers, unless you could find a suitable MLM head to copy (probably not…).

BramVanroy · January 4, 2021, 6:28pm

The reason you are not getting a response is because this is near impossible to debug: these are third-party models that someone else trained. It is possible that they did not train/finetune these models on MLM and in such event the model doesn’t know what to output for the mask.

You should try to get into contact with the model creators to get an answer.

Topic		Replies	Views
BioBERT NER issue Beginners	7	4540	November 27, 2022
Unexpected result from transformer model prediction Beginners	0	288	November 21, 2021
All my sequences get tokenized the same 🤗Tokenizers	2	609	February 12, 2022
Error using pretraining tokenizer for spanish biomedical ner Beginners	0	285	May 16, 2022
Where in the code does masking of tokens happen when pretraining BERT Beginners	5	7260	August 17, 2020

Strange output using BioBERT for imputing MASK tokens

Related topics