Saving fine-tuned MT5ForSequenceClassification


I fine-tuned MT5ForSequenceClassification on a dataset as a regression. However, after fine-tuning, the saved model’s config says the architecture is MT5ForConditionalGeneration. Is this supposed to be like that, or is something wrong? Why did it change?

Also, I don’t know how to do inference with my saved model. The paper I am trying to replicate says “We adapt mT5 into a regression-metric by applying a RMSE loss between the logits of a special classification token and the label which is either 0 or 1. During inference, we then force-decode the classification token and extract its probability.”

But I’m not sure how to do the last sentence.

My eval loss is 0.2576, which isn’t bad, so I should be able to get reasonable results with inference but I haven’t been able to make it work.

Any help would be appreciated. I am using this training script with these args:

    --model_name_or_path ${MODEL_SCRATCH} \
    --train_file ${DATA_SCRATCH}/train_stata.csv \
    --validation_file ${DATA_SCRATCH}/dev_stata.csv \
    --test_file ${DATA_SCRATCH}/test_stata.csv \
    --do_regression True \
    --metric_name rmse \
    --text_column_name "linearized_input,output" \
    --label_column_name attributable \
    --do_train \
    --do_eval \
    --do_predict \
    --max_seq_length 2048 \
    --per_device_train_batch_size 8 \
    --learning_rate 1e-4 \
    --lr_scheduler_type constant \
    --ignore_mismatched_sizes True \
    --num_train_epochs 1 \
    --output_dir ${OUTPUT_DIR} \
    --overwrite_output_dir True \
    --gradient_checkpointing True \
    --gradient_accumulation_steps 1 \
    --eval_accumulation_steps 1 \
    --text_column_delimiter "[output]" \
    --save_total_limit 1 \
    --load_best_model_at_end True \

And here is some of the output logs:

[INFO|] 2024-01-07 22:03:10,696 >> loading weights file /disk/scratch/s2029717/mt5-large-seq/model.safetensors
[INFO|] 2024-01-07 22:03:17,330 >> All model checkpoint weights were used when initializing MT5ForSequenceClassification.

[WARNING|] 2024-01-07 22:03:17,330 >> Some weights of MT5ForSequenceClassification were not initialized from the model checkpoint at /disk/scratch/s2029717/mt5-large-seq and are newly initialized because the shapes did not match:
- classification_head.out_proj.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([1]) in the model instantiated
- classification_head.out_proj.weight: found shape torch.Size([2, 1024]) in the checkpoint and torch.Size([1, 1024]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Why am I getting this warning when I am loading a ForSequenceClassification model?

You can see the model files here.


It seems like the model you saved is an MT5ForConditionalGeneration model as seen here. Hence if you load it using MT5ForSequenceClassification.from_pretrained it will tell you that not all the weights are initialized from the checkpoint.

Regarding the paper, that would require adding a special classification token to the vocabulary of the model + tokenizer and make sure the final hidden state of that token is turned into logits. Could look something like:

# in the init of the model:
self.classifier = nn.Linear(hidden_size, num_labels)


# in the forward of the model:
# extract final hidden state of the special classification token, assuming it's the first one
# this gives us a tensor of shape (batch_size, hidden_size)
cls_features = final_hidden_states[:,0,:]

# apply classifier to turn `cls_features` into `logits` of shape (batch_size, num_labels)
logits = self.classifier(cls_features)

Hi @nielsr, thanks for the reply

It seems like the model you saved is an MT5ForConditionalGeneration model

Yeah I know, but during training I loaded a pretrained MT5ForSequenceClassification model, so what I’m confused about is why would it save my finetuned model as MT5ForConditionalGeneration?

This is from the fine-tuning logs:

[INFO|] 2024-01-07 22:03:06,753 >> loading configuration file /disk/scratch/s2029717/mt5-large-seq/config.json
[INFO|] 2024-01-07 22:03:06,817 >> Model config MT5Config {
  "_name_or_path": "/disk/scratch/s2029717/mt5-large-seq",
  "architectures": [
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "finetuning_task": "text-classification",
  "id2label": {
    "0": "LABEL_0"
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "label2id": {
    "LABEL_0": 0
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "torch_dtype": "float32",
  "transformers_version": "4.37.0.dev0",
  "use_cache": true,
  "vocab_size": 250112

@nielsr also, I still don’t know how to do inference with the fine-tuned model.

I tried the code here but this predicts a class, when since I am doing a regression I want to extract the probability of my single class instead.

Bit embarrassing, the error was there was a mistake in the script I used to upload the model to the Hub. It was using AutoSeq2SeqLM instead of AutoSequenceClassification.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.