Auto vs. Model-specific classes and tokenizers

Hi!

I have a private model (a further pre-trained version of distilbert-base-uncased) that I want to use to get the most likely tokens in a give sequence of text (i.e. MLM). I’m using the FillMaskPipeline for this. However, I noticed that I’m getting different results when I load the model and the tokenizer using the Auto classes (AutoModelForMaskedLM and AutoTokenizer) compared to when I use the model-specific classes (DistilBertForMaskedLM and DistilBertTokenizer).

Why could this be the case? I would really appreciate any hints or ideas to think about this!

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(private_model_name, use_auth_token=api_token)
model = AutoModelForMaskedLM.from_pretrained(private_model_name, output_attentions=True, use_auth_token=api_token)

# create MLM pipeline
unmasker = FillMaskPipeline(model=model,
                            tokenizer=tokenizer,
                            device=-1,
                            top_k=5)

# example sentence
input_text = "as a leading firm in the [MASK] sector, we hire highly skilled software engineers."

# transform the input text into the appropriate format
test_text = input_text.replace("[MASK]", f"{unmasker.tokenizer.mask_token}")

# get predictions
output = unmasker(test_text)

# print results
for i, result in enumerate(output, start=1):
    print(f"{i}. {result['token_str']} (with probability: {np.round(result['score'], 3)})")
1 Like