Hi,
I’m trying to fine-tune sequence regression using BERT in French i.e. CamembertForSequenceClassification
I simply load the model and tokenizer passing num_labels=1
to actually run a regression task (my labels are actually either 0 or 1, I tried either as classification task with 2 labels or 1 label regression with the same issue)
import transformers
model = transformers.CamembertForSequenceClassification.from_pretrained("camembert-base", num_labels=1)
tokenizer = transformers.AutoTokenizer.from_pretrained("camembert-base", num_labels=1)
def tokenize_bert(texts):
self = model
tokenizer_kwargs = {
"truncation": True,
"max_length": 512,
"padding": True,
"return_tensors": "pt",
}
return tokenizer(
texts,
**tokenizer_kwargs,
)
But when I’m generating, all output logits are exactly the same, and I figured out the encoder was outputting the exact same representation for any sequence:
-
I check that tokenized sequence are not the same:
-
I check that embeddings are actually different:
-
But then I notice that the encoder output is exactly the same for all input.
Any idea how I could fix my training (more than 50k steps, with mostly default parameters) to make it work?