Auto vs. Model-specific classes and tokenizers

yabramuvdi · April 4, 2023, 3:00pm

Hi!

I have a private model (a further pre-trained version of distilbert-base-uncased) that I want to use to get the most likely tokens in a give sequence of text (i.e. MLM). I’m using the FillMaskPipeline for this. However, I noticed that I’m getting different results when I load the model and the tokenizer using the Auto classes (AutoModelForMaskedLM and AutoTokenizer) compared to when I use the model-specific classes (DistilBertForMaskedLM and DistilBertTokenizer).

Why could this be the case? I would really appreciate any hints or ideas to think about this!

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(private_model_name, use_auth_token=api_token)
model = AutoModelForMaskedLM.from_pretrained(private_model_name, output_attentions=True, use_auth_token=api_token)

# create MLM pipeline
unmasker = FillMaskPipeline(model=model,
                            tokenizer=tokenizer,
                            device=-1,
                            top_k=5)

# example sentence
input_text = "as a leading firm in the [MASK] sector, we hire highly skilled software engineers."

# transform the input text into the appropriate format
test_text = input_text.replace("[MASK]", f"{unmasker.tokenizer.mask_token}")

# get predictions
output = unmasker(test_text)

# print results
for i, result in enumerate(output, start=1):
    print(f"{i}. {result['token_str']} (with probability: {np.round(result['score'], 3)})")

Topic		Replies	Views
[unused] tokens in predicting with MLM model Beginners	0	781	January 3, 2022
Difference between "Auto Model" and "Auto Model For Token Classification" in BERT fine tuning 🤗Transformers	1	1772	June 25, 2022
Masking specific token in each input sentence during Masked language modelling 🤗Transformers	0	1041	October 18, 2021
Why does my MLM model still not output emojis after adding them as special tokens? Beginners	0	422	June 29, 2021
Different model.generate() predictions between batched and unbatched/padded token inputs 🤗Transformers	2	2248	August 26, 2023

Auto vs. Model-specific classes and tokenizers

Related topics