Sequence Classification -- Fine Tune?


I am new to Transformers/NLP.

I am trying to use Transformers for text classification. If I am not classifying to one of the pre-made GLUE benchmarks (and using my own use-case classes & texts), do I have to “fine-tune” the model? If I have 35k texts, and 2 labels (imbalanced – 98% vs 2%) can I just use the AutoModelforSequenceClassification to throw a softmax on the end of the transformer, and train that softmax? Or do I fine-tune the whole thing, using this tutorial?

Thanks! I am excited about better understanding the field and more effectively using the library.

Hi @AlanFeder,

I’m not familiar with imbalanced datasets, but if I were you, I would try using examples/text-classification/
In this example, instead of assigning GLUE benchmarks task names, we can use our own train_file, validation_file, and test_file.

python \
  --model_name_or_path bert-base-cased \
  --train_file train_file_name \
  --validation_file validation_file_name \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir /path/to/output/

You can use CSV or JSON.
datasets.load_dataset is used in the script, so the document of datasets.load_dataset for local files may help you.

I remember that columns of the following format were easier to handle, but even if the names or order of the columns are different, it may work if the code is rewritten appropriately.
(Please see transformers/ at 5ed5a54684ef059fa4c9710858b8e03c61295914 · huggingface/transformers · GitHub for the detail.)

sentence1, sentence2, label


sentence, label

Again, I’m not familiar with imbalanced datasets, so I don’t think I’ve answered your question “do I have to fine-tune?” Sorry.

I am just one of the users who is learning about this library, so I hope you can get more useful advice from someone who knows more than I do.


Hi @AlanFeder, in addition to @yusukemori’s useful advice you might find it instructive to start by working through the following tutorial on text classification:

The short answer to your question is that you generally do have to fine-tune one of the pretrained language models like distilbert-base-uncased using AutoModelForSequenceClassification.

An alternative, but less performant, approach is to just use the last hidden states of the model as input features to a classifier like logistic regression. Jay Alammar has a nice example here (along with tons of great explanations about Transformers): A Visual Guide to Using BERT for the First Time – Jay Alammar – Visualizing machine learning one concept at a time.

To tackle the imbalance, you could try upsampling (downsampling) the minority (majority) class or failing that weight the classes directly in the loss function of the Trainer: Trainer — transformers 4.2.0 documentation



Thanks @lewtun and @yusukemori for your help!

I tried the method you mentioned in Jay Alammar’s post – it indeed worked, but had weak performance (was beaten by my “benchmark” of tfidf/logistic regression) - so I will indeed attempt to use the fine-tune with Trainer. I may try to downsample the majority class, at least at first, to make it run faster.

Thanks again!