BERT (CamemBERT) for Sequence Classification maps any sequence to the exact same encoding

pltrdy · July 7, 2023, 1:52pm

Hi,

I’m trying to fine-tune sequence regression using BERT in French i.e. CamembertForSequenceClassification

I simply load the model and tokenizer passing num_labels=1 to actually run a regression task (my labels are actually either 0 or 1, I tried either as classification task with 2 labels or 1 label regression with the same issue)

import transformers

model = transformers.CamembertForSequenceClassification.from_pretrained("camembert-base", num_labels=1)
tokenizer = transformers.AutoTokenizer.from_pretrained("camembert-base", num_labels=1)

def tokenize_bert(texts):
    self = model
    tokenizer_kwargs = {
        "truncation": True,
        "max_length": 512,
        "padding": True,
        "return_tensors": "pt",
    }

    return tokenizer(
        texts,
        **tokenizer_kwargs,
    )

But when I’m generating, all output logits are exactly the same, and I figured out the encoder was outputting the exact same representation for any sequence:

I check that tokenized sequence are not the same:

image1258×351 39.1 KB
I check that embeddings are actually different:
But then I notice that the encoder output is exactly the same for all input.

Any idea how I could fix my training (more than 50k steps, with mostly default parameters) to make it work?

Topic		Replies	Views
Fine-tune BERT and Camembert for regression problem Beginners	17	11200	December 4, 2021
All my sequences get tokenized the same 🤗Tokenizers	2	609	February 12, 2022
Multi-label sequence labeling (for e.g., multi-label NER) 🤗Transformers	0	1528	November 21, 2022
Evaluating Finetuned BERT Model for Sequence Classification Beginners	10	8473	October 25, 2022
BERT finetuning "index out of range in self" Intermediate	2	4115	August 24, 2021

BERT (CamemBERT) for Sequence Classification maps any sequence to the exact same encoding

Related topics