Fine-Tuning BERT with MRPC data , but inference result is not as expected

raytracerst · November 16, 2025, 3:11am

Hlello， i follow the complete steps in Fine-tuning a model with the Trainer API - Hugging Face LLM Course , and also get a trained model, but when i using pipeline api with flowing code, the result is not what i expect, is there any suggestion about this,thanks a log

def predict_with_trained_model():
    classifier = pipeline("text-classification", model="./test-trainer/checkpoint-1377", tokenizer="./test-trainer/checkpoint-1377")
    result = classifier(["I have a nice day today","the whether is god"])
    print("result is:{0}".format(result))

John6666 · November 17, 2025, 6:43am

Is the pipeline being used incorrectly…?

You’re doing the mechanics correctly (you trained and loaded the checkpoint), but you’re asking the model to do a different task from what it was trained for, and you’re feeding it inputs in the wrong shape.

Below is a step-by-step explanation of:

What your model actually learned.
Why your current pipeline call gives “strange” results.
How to use this MRPC model correctly.
What to do instead if you actually want single-sentence sentiment/quality classification.
A short checklist + references.

1. What your MRPC-fine-tuned BERT actually is

In the Hugging Face LLM course chapter you followed, they fine-tune BERT on MRPC (Microsoft Research Paraphrase Corpus). (Hugging Face)

Key facts:

MRPC is a paraphrase detection dataset:
- Each example has:
  - sentence1
  - sentence2
  - label (0 = not a paraphrase, 1 = paraphrase)
- Task: “Do these two sentences mean the same thing?”

In the course, they tokenize pairs of sentences:

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
# from the LLM course MRPC processing section
# https://huggingface.co/learn/llm-course/en/chapter3/2

They load a sequence classification model with 2 labels:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)
# from the Trainer fine-tuning example
# https://huggingface.co/learn/llm-course/en/chapter3/3

So your checkpoint:

Input during training: sentence1 + sentence2 together.
Output labels: a 2-class prediction:
- 0 → “not_equivalent” (not paraphrases)
- 1 → “equivalent” (paraphrases) (Hugging Face)

It is not a model for:

“Is this sentence positive or negative?”
“Is this sentence grammatical?”
“Does this sentence sound nice?”

It is only trained to compare two sentences and say “same meaning” vs “different meaning”.

2. Why your current inference looks wrong

Your code:

def predict_with_trained_model():
    classifier = pipeline(
        "text-classification",
        model="./test-trainer/checkpoint-1377",
        tokenizer="./test-trainer/checkpoint-1377"
    )
    result = classifier(["I have a nice day today", "the whether is god"])
    print("result is:{0}".format(result))

2.1 Task mismatch

The MRPC model is a paraphrase classifier (sentence pair → “equivalent” / “not_equivalent”).
Your call is treating it like a single-sentence classifier (one sentence → some label).

So, conceptually, you trained it to answer:

“Given two sentences, do they mean the same thing?”

But you are now asking:

“Given this single sentence, is it good / bad / correct / something?”

The model will still output some probability over its 2 labels, but those labels do not correspond to sentiment or grammar. They still mean “paraphrase” vs “not paraphrase” – even if printed as LABEL_0 / LABEL_1. (Hugging Face)

2.2 Input shape mismatch: pairs vs single sentences

During training:

The tokenizer always saw two sentences together. Under the hood, BERT sees something like:
```
[CLS] sentence1 tokens ... [SEP] sentence2 tokens ... [SEP]
```
with segment IDs / token_type_ids separating the two parts. (Hugging Face)

At inference in your code:

classifier(["I have a nice day today", "the whether is god"]) is interpreted as a batch of two independent single sentences.
There is no sentence2 at all. No [SEP] separating two sentences, no pair structure.

The classifier head was trained on pair representations and now sees single-sentence representations. It still produces logits, but it is operating off-distribution, so scores are not meaningful in the way you expect.

The Hugging Face docs explain that to classify pairs, you must pass a dict with "text" and "text_pair" (or a list of such dicts). (Hugging Face)

2.3 Label naming: `LABEL_0` vs real labels

Unless you set them yourself, the config will typically hold:

id2label = {0: "LABEL_0", 1: "LABEL_1"}
label2id = {"LABEL_0": 0, "LABEL_1": 1}

But for MRPC, the dataset semantics are:

0 → not_equivalent
1 → equivalent (Hugging Face)

If you mentally read:

LABEL_0 as “negative sentiment”
LABEL_1 as “positive sentiment”

you will misinterpret the predictions. They aren’t sentiment labels at all.

2.4 Domain mismatch

MRPC sentences are:

News snippets, fairly clean, formal. (Hugging Face)

Your test sentences:

"I have a nice day today"
"the whether is god"  # typos: "weather" and "good"

are short, casual, and include spelling errors. Even as a paraphrase task:

“Do these two sentences mean the same thing?” → they are not clearly paraphrases.

So you have a domain mismatch on top of everything else. The model has never seen text like this in training, so its confidence values become even less interpretable.

3. How to use your MRPC model correctly

If your goal is:

“I fine-tuned on MRPC and I want to classify whether two sentences are paraphrases.”

then you should:

Feed pairs into the pipeline.
Interpret labels as equivalent / not_equivalent.
Optionally rename labels in the config so they are human-readable.

3.1 Proper use of `pipeline` for sentence pairs

The text-classification pipeline supports text pairs using {"text", "text_pair"}: (Hugging Face)

from transformers import pipeline

def predict_paraphrase():
    model_dir = "./test-trainer/checkpoint-1377"

    classifier = pipeline(
        "text-classification",
        model=model_dir,
        tokenizer=model_dir,
    )

    examples = [
        {
            "text": "The company Hugging Face is based in New York City.",
            "text_pair": "Hugging Face is headquartered in NYC.",
        },
        {
            "text": "I have a nice day today.",
            "text_pair": "The weather is good today.",
        },
    ]

    preds = classifier(examples)
    print(preds)

if __name__ == "__main__":
    predict_paraphrase()
    # If id2label is still default, you'll see LABEL_0 / LABEL_1
    # LABEL_1 == paraphrase (equivalent), LABEL_0 == not equivalent (for MRPC)
    # MRPC description: https://huggingface.co/learn/llm-course/en/chapter3/2

Internally, the pipeline will:

Build a correct pair input ([CLS] text [SEP] text_pair [SEP]).
Run the MRPC head on top.

Now the model is being used for the same type of input it saw during training, and its logits align with its training objective.

3.2 Rename labels for clarity

You can store better label names directly in the model config:

from transformers import AutoModelForSequenceClassification

model_dir = "./test-trainer/checkpoint-1377"

model = AutoModelForSequenceClassification.from_pretrained(model_dir)

model.config.id2label = {0: "not_equivalent", 1: "equivalent"}
model.config.label2id = {"not_equivalent": 0, "equivalent": 1}

model.save_pretrained(model_dir)
# Next time you create a pipeline from this folder, you'll see those labels.
# label mapping pattern: https://huggingface.co/docs/transformers/en/tasks/sequence_classification

After this, running the same pipeline will return:

[{'label': 'equivalent', 'score': 0.93}, ...]

instead of LABEL_1, which is clearer.

3.3 Optional: use a dedicated MRPC “pair-classification” pipeline

Hugging Face provides an example MRPC model sgugger/finetuned-bert-mrpc which is set up for a custom “pair-classification” pipeline used in older docs and examples. (Hugging Face)

The underlying idea is the same: treat inputs explicitly as sentence pairs and output paraphrase / non-paraphrase. Your own model can be used in essentially the same way with the general text-classification pipeline + text / text_pair inputs, which is simpler and more standard now.

4. What if you actually want sentiment / “good vs bad” on single sentences?

Many people follow the MRPC tutorial and then expect:

classifier(["I have a nice day today", "the weather is bad"])

to give something like POSITIVE / NEGATIVE. That will not happen with an MRPC model, because MRPC is not a sentiment dataset.

If your real goal is:

“Given a single sentence, tell me whether it’s positive/negative (or good/bad).”

then you need a single-sentence sentiment model, not MRPC.

4.1 Use a sentiment model that’s already trained

You can simply load a sentiment model fine-tuned on IMDb or SST-2 (Hugging Face has many, and the text-classification guide shows how to fine-tune one). (Hugging Face)

Example:

from transformers import pipeline

# Example sentiment model from HF hub (replace with one you like)
# See: https://huggingface.co/models?pipeline_tag=text-classification
sentiment = pipeline("sentiment-analysis")  # default is usually a sentiment model

result = sentiment(["I have a nice day today", "the weather is terrible"])
print(result)
# => [{'label': 'POSITIVE', 'score': ...}, {'label': 'NEGATIVE', 'score': ...}]

These models are trained on single sentences or reviews, with labels like POSITIVE / NEGATIVE. The task and input format match your expectations.

4.2 Fine-tune your own sentiment model (if you want custom behavior)

If you need your own domain-specific sentiment classification:

Pick a single-sentence classification dataset:
- IMDb, Yelp, Amazon reviews (for positive/negative).
- Or your own labeled data.
Follow the standard Text classification fine-tuning tutorial:
- It uses DistilBERT + IMDb as an example. (Hugging Face)

Set id2label / label2id to sentiment labels, e.g.:

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

After fine-tuning, you can safely do:

sentiment = pipeline(
    "text-classification",
    model="./my-sentiment-checkpoint",
    tokenizer="./my-sentiment-checkpoint",
)
sentiment(["I have a nice day today", "the weather is bad"])

Now the model has:

The right objective (sentiment).
The right input shape (single sentences).
Meaningful labels (POSITIVE/NEGATIVE).

5. Extra gotchas and debugging tips

These are smaller issues, but good to be aware of:

5.1 Check which checkpoint you’re loading

In the course, they use TrainingArguments("test-trainer", ...), which saves multiple checkpoints (e.g. checkpoint-500, checkpoint-1000, etc.). (Hugging Face)

./test-trainer/checkpoint-1377 is just one of them.
Sometimes you want the “best” checkpoint (lowest validation loss), not necessarily the last one.
You can enable load_best_model_at_end=True in TrainingArguments for future runs.

5.2 Verify that training actually worked

Before using the model “in the wild”, test it on MRPC validation data:

Use trainer.evaluate() / trainer.predict() on the validation split. (Hugging Face)
Check that accuracy/F1 is reasonable (around 0.8+ on MRPC is typical for BERT). (Hugging Face)

If metrics are random (~0.5 for binary), then training itself has a problem.

5.3 Be aware of text-pair pitfalls in pipelines

Older versions of text-classification pipeline sometimes handled text pairs awkwardly if you passed them as simple lists/tuples. Issues and discussions show recommended patterns and edge-cases: (GitHub)

Correct usage is to pass {"text": s1, "text_pair": s2} or lists of such dicts, exactly as in the docs. (Hugging Face)

6. Summary

Restating the core ideas in one place:

Your fine-tuned checkpoint is an MRPC paraphrase classifier, not a general “sentence quality” or “sentiment” model. (Hugging Face)
MRPC training uses sentence pairs (sentence1, sentence2) as input; at inference you must also give pairs via {"text", "text_pair"}.
The labels are really “not_equivalent” and “equivalent”, even if printed as LABEL_0 and LABEL_1. You can rename them in the config for clarity.
Passing single, unrelated sentences (like "I have a nice day today", "the whether is god") to this model is asking it to solve a task it never learned; outputs will not match “positive/negative” expectations.
If you want single-sentence sentiment or correctness judgments, either:
- Use an existing sentiment model, or
- Fine-tune a model specifically for sentiment/quality, following the Text classification tutorial.

7. Curated references and guides

Grouped so you can dive deeper where needed:

A. MRPC and the LLM course

Processing MRPC data (definition of MRPC, sentence pairs, labels):
Hugging Face LLM course – “Processing the data” (MRPC). (Hugging Face)
Fine-tuning with Trainer on MRPC:
Hugging Face LLM course – “Fine-tuning a model with the Trainer API” (BERT on MRPC). (Hugging Face)

B. Pipelines and text pairs

Pipelines documentation (text pairs via text / text_pair):
Transformers pipeline docs (shows using dicts for text pairs). (Hugging Face)
HF forum Q&A: using pipelines for text-pair classification (NLI):
Shows that you must pass {"text": ..., "text_pair": ...}. (Hugging Face Forums)
GitHub issue about pipeline + text pairs:
Discussion of how naive tuple/list inputs can misbehave and why the dict style is preferred. (GitHub)

C. Example MRPC model and custom pair-classification pipeline

sgugger/finetuned-bert-mrpc model card:
Example of a BERT model fine-tuned on GLUE MRPC, with evaluation results. (Hugging Face)
Discussion about the custom pair-classification pipeline:
Issue discussing the custom pipeline pattern used with sgugger/finetuned-bert-mrpc. (GitHub)

D. General text classification / sentiment

Text classification task guide (IMDb, DistilBERT, sentiment, id2label/label2id): (Hugging Face)
General BERT overview (what BERT can do, including sentiment): (LearnOpenCV)

If you adjust your code to (a) feed sentence pairs, and (b) interpret labels as paraphrase vs non-paraphrase, your MRPC model’s behavior will start to make sense. If you instead want single-sentence sentiment, you should switch to a sentiment-fine-tuned model or dataset.

Topic		Replies	Views
Inference from a fine-tuned model -- help with interpretation of results Beginners	3	379	January 26, 2024
Fine-tuning lm with nsp 🤗Transformers	0	1176	January 19, 2021
Looking for tool class to do predictions 🤗Transformers	3	569	October 9, 2020
Smart Batching - speech up Bert finetune Intermediate	0	686	March 15, 2021
Fine-tuning Bert/Roberta for multi-label sentiment analysis Beginners	0	1648	November 8, 2021