I fine-tuned MT5ForSequenceClassification on a dataset as a regression. However, after fine-tuning, the saved model’s config says the architecture is MT5ForConditionalGeneration. Is this supposed to be like that, or is something wrong? Why did it change?

Also, I don’t know how to do inference with my saved model. The paper I am trying to replicate says “We adapt mT5 into a regression-metric by applying a RMSE loss between the logits of a special classification token and the label which is either 0 or 1. During inference, we then force-decode the classification token and extract its probability.”

But I’m not sure how to do the last sentence.

My eval loss is 0.2576, which isn’t bad, so I should be able to get reasonable results with inference but I haven’t been able to make it work.

Any help would be appreciated. I am using this training script with these args:

    --model_name_or_path ${MODEL_SCRATCH} \
    --train_file ${DATA_SCRATCH}/train_stata.csv \
    --validation_file ${DATA_SCRATCH}/dev_stata.csv \
    --test_file ${DATA_SCRATCH}/test_stata.csv \
    --do_regression True \
    --metric_name rmse \
    --text_column_name "linearized_input,output" \
    --label_column_name attributable \
    --do_train \
    --do_eval \
    --do_predict \
    --max_seq_length 2048 \
    --per_device_train_batch_size 8 \
    --learning_rate 1e-4 \
    --lr_scheduler_type constant \
    --ignore_mismatched_sizes True \
    --num_train_epochs 1 \
    --output_dir ${OUTPUT_DIR} \
    --overwrite_output_dir True \
    --gradient_checkpointing True \
    --gradient_accumulation_steps 1 \
    --eval_accumulation_steps 1 \
    --text_column_delimiter "[output]" \
    --save_total_limit 1 \
    --load_best_model_at_end True \

And here is some of the output logs:

[INFO|] 2024-01-07 22:03:10,696 >> loading weights file /disk/scratch/s2029717/mt5-large-seq/model.safetensors
[INFO|] 2024-01-07 22:03:17,330 >> All model checkpoint weights were used when initializing MT5ForSequenceClassification.

[WARNING|] 2024-01-07 22:03:17,330 >> Some weights of MT5ForSequenceClassification were not initialized from the model checkpoint at /disk/scratch/s2029717/mt5-large-seq and are newly initialized because the shapes did not match:
- classification_head.out_proj.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([1]) in the model instantiated
- classification_head.out_proj.weight: found shape torch.Size([2, 1024]) in the checkpoint and torch.Size([1, 1024]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Why am I getting this warning when I am loading a ForSequenceClassification model?

You can see the model files here.


It seems like the model you saved is an MT5ForConditionalGeneration model as seen here. Hence if you load it using MT5ForSequenceClassification.from_pretrained it will tell you that not all the weights are initialized from the checkpoint.

Regarding the paper, that would require adding a special classification token to the vocabulary of the model + tokenizer and make sure the final hidden state of that token is turned into logits. Could look something like:

# in the init of the model:
self.classifier = nn.Linear(hidden_size, num_labels)


# in the forward of the model:
# extract final hidden state of the special classification token, assuming it's the first one
# this gives us a tensor of shape (batch_size, hidden_size)
cls_features = final_hidden_states[:,0,:]

# apply classifier to turn `cls_features` into `logits` of shape (batch_size, num_labels)
logits = self.classifier(cls_features)

Hi @nielsr, thanks for the reply

It seems like the model you saved is an MT5ForConditionalGeneration model

Yeah I know, but during training I loaded a pretrained MT5ForSequenceClassification model, so what I’m confused about is why would it save my finetuned model as MT5ForConditionalGeneration?

This is from the fine-tuning logs:

[INFO|] 2024-01-07 22:03:06,753 >> loading configuration file /disk/scratch/s2029717/mt5-large-seq/config.json
[INFO|] 2024-01-07 22:03:06,817 >> Model config MT5Config {
  "_name_or_path": "/disk/scratch/s2029717/mt5-large-seq",
  "architectures": [
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "finetuning_task": "text-classification",
  "id2label": {
    "0": "LABEL_0"
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "label2id": {
    "LABEL_0": 0
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "torch_dtype": "float32",
  "transformers_version": "4.37.0.dev0",
  "use_cache": true,
  "vocab_size": 250112

@nielsr also, I still don’t know how to do inference with the fine-tuned model.

I tried the code here but this predicts a class, when since I am doing a regression I want to extract the probability of my single class instead.

Bit embarrassing, the error was there was a mistake in the script I used to upload the model to the Hub. It was using AutoSeq2SeqLM instead of AutoSequenceClassification.

