Fine-Tune for MultiClass or MultiLabel-MultiClass


I want to build a:

  1. MultiClass Label
    (eg: Sentiment with VeryPositiv, Positiv, No_Opinion, Mixed_Opinion, Negativ, VeryNegativ)

  2. and a MultiLabel-MultiClass model to detect 10 topics in phrases
    (eg: Science, Business, Religion …etc)

and I am not sure where to find the best model for these types of tasks?

I understand this refers to the Sequence Classification Task. So, I could search for a model tagged with that task on your model repository site - but not all models are tagged like that and the transformers API seems to provide much more task applications beyond the original training.

I found with the code below that I can have a model that supports originally 5 labels but load it into a ConvBertForSequenceClassification model to support, for example 25 labels. Would this (plus softmax or sigmoid and fine-tuning) be the correct way to pick up an existing model and implement 1. or 2. or is there a different more effective way to choose a model and fine tune it?

Thanks dirk

from transformers import pipeline
nlp = pipeline("sentiment-analysis", 'bert-base-multilingual-uncased-sentiment')
result = nlp("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
result = nlp("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

#label: 1 star, with score: 0.6346
#label: 5 stars, with score: 0.8547

from transformers import ConvBertForSequenceClassification, ConvBertTokenizer

convBertModel = ConvBertForSequenceClassification.from_pretrained('bert-base-multilingual-uncased-sentiment', num_labels=25)
convBerttokenizer = ConvBertTokenizer.from_pretrained('bert-base-multilingual-uncased-sentiment')

print ( f"                              num_labels: {model.num_labels}")
print ( f"                              classifier: {model.classifier}")

# num_labels: 25
# classifier: ConvBertClassificationHead(
#   (dense): Linear(in_features=768, out_features=768, bias=True)
#   (dropout): Dropout(p=0.1, inplace=False)
#   (out_proj): Linear(in_features=768, out_features=25, bias=True)
# )
1 Like

Hi @dikster99,

The way I usually search for models on the Hub is by selecting the task in the sidebar, followed by applying a filter on the target dataset (or querying with the search bar if I know the exact name). In both your cases, you’re interested in the Text Classification tags, which is a specific example of sequence classification:

However, this assumes that someone has already fine-tuned a model that satisfies your needs. If not, there are two main options:

  • If you have your own labelled dataset, fine-tune a pretrained language model like distilbert-base-uncased (a faster variant of BERT). You can find a nice example for text classification here and see here for the multi-label case. In general this is similar to your second example with ConvBertForSequenceClassification and you were correct to specify num_labels in the from_pretrained function :slight_smile:
  • If you have no labelled dataset, then you could try out one of the Zero-Shot Classification models (e.g. here). See this blog post for an explanation on how zero-shot works.

Hope that helps!

PS. the Pipeline classes are typically used for generating predictions from a fine-tuned model, so your example with bert-base-multilingual-uncased-sentiment wouldn’t work because that model was trained on 5 labels and does not know about the 25 labels you are interested in.

1 Like

Hi Lewtun,

thanks for answering so quickly and precise. I have labelled data. So, I am looking at the first option. Its clear to me that I cannot use the pipeline statement since the model does not satisfy my needs out of the box. I was just using it to a have reference to something concrete. I guess I am just confused by the number of different models supported and some let you specify num_labels and some will accept exactly one number (eg 5)…

…but I understand now that I can specify a different num_labels value for some models and this will result in:

removing the head used to pretrain the model on a masked language modeling objective and
replacing it with a new head` as described in your linked tutorial (Fine Tuning section).

So, to get me started with the MultiClass task where I have 1 label but 6 values (VeryPositiv, Positiv, No_Opinion, Mixed_Opinion, Negativ, VeryNegativ) - I could actually use, for example:

  • a model trained for a True/False sentiment,
  • specify num_labels=1 and train it with my labels containing 6 values
    and that should work if my labels are sensible.

Did I get that right?

I guess, I’ll just get started with the MultiClass and see how I fare :slight_smile:

Thanks a lot for helping me, Dirk

PS.: I am working in a cloud environment without Internet. So, I had to resolve the question of how to download a model and use it locally, which I had a hard time finding in documentation. So, I constructed 2 notebooks that show case:

  1. How a model can be downloaded
  2. How a local model can be used

I actually needed about 2 days to figure this out just because I was missing the git lfs part (and it generates an error message shown in the notebook) and the error message from the Transformers library did not see that the partially downloaded files where actually wrong :slight_smile: …and I probably made other silly mistakes too … So, I am thinking that others might have the same problem and it might be good to include something as specific as the 2 notebooks in your documentation to be more specific and clear on the what and how can be done with the Transformers :slight_smile: What do you think?

Hi @dikster99,

So, to get me started with the MultiClass task where I have 1 label but 6 values (VeryPositiv, Positiv, No_Opinion, Mixed_Opinion, Negativ, VeryNegativ) - I could actually use, for example:

  • a model trained for a True/False sentiment,
  • specify num_labels=1 and train it with my labels containing 6 values
    and that should work if my labels are sensible.

Did I get that right?

I think in this case you actually have 6 labels, one for each independent category. For example, such a dataset might take the following form:

Text Label
I really :heart: Transformers! VeryPositiv
The pandemic is a drag :crying_cat_face: Negativ
I went for a walk today No_Opinion

so you have one label column, but each row can take one of six values.

To get started, my suggestion would be to load up one of the pretrained language models (e.g. distilbert-base-uncased), specify num_labels=6, and then tokenize the dataset / build the Trainer as done in Sylvain’s tutorial, e.g. something like

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=6).to("cuda")

def tokenize_text(batch):
    # Replace "text" with whatever column name has your text inputs
    return tokenizer(batch["text"], truncation=True)

dataset_enc =, batched=True)

Regarding, your experience with downloading the models locally, did you know about the “Use in transformers” button from the model hub?

If not, perhaps it should be made clearer in the documentation but for that I suggest opening an issue or PR on GitHub so the maintainers can provide you feedback :slight_smile:

Hi Lewis,

I got to be honest - the GLUE example is a bit out of my league right now as I just started to do some work on TensorFlow… :frowning: …but the good news is that I’ve found a Sentiment Classification sample that seems to have a much lower boundary in order to get it work and make sense of it. :smile: So, I took this and developed a MultiClass classification from it (essentially setting the num_labels as discussed and using a public demo dataset as you rightfully indicated).

I see that the above solutions work with a yield mechanism so I am wondering if this implementation will scale good in terms of memory usage for larger files with more labels? What do you think?

Otherwise, I am thinking that I can develop a MultiLabel-MultiClass Classification from the MultiClass classification by providing a Pandas label column with a list of values (eg ‘[0,1,0,1]’) and setting the num_labels to the length of the array in the label column. And replacing the Softmax with a Sigmoid function to yield the correct result. Would you say this is about right or am I missing something again :frowning: ?


mhh, it looks like my assumtions about the MultiLabel MultiClass implementation with the Transformers library is missing a detail :cry: I tried to change the code as described in the last post and now I am receiving an exception from 9 frames deep stacktrace:

InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  ValueError: `generator` yielded an element of shape (6,) where an element of shape () was expected.
Traceback (most recent call last):


(1) Invalid argument:  ValueError: `generator` yielded an element of shape (6,) where an element of shape () was expected.

I get that the shape is not as expected but is that:

  1. because I am missing a configuration parameter or
  2. because I am using the labels in a wrong way? and if it is the 2nd, how am I supposed to tell the model that multiple labels can be applicable for a single text?

Does anyone have a hint towards what I am doing wrong here?

Hi @lewtun,

I have looked at this MultiLabel sample that you previously recommended and found it throws the exception in 1. (see below) - I understand from searching the net that there was a breaking change in Transformers Version 3.0 and was able to resolve this problem by setting truncation=True in the tokenizer section of the CustomDataset class.

But then the code is still throwing the exception in 2. and this is where I am lost without a fix - would you have any idea on how to get this code to work in a current version of transformers version 4.3.3?

  1. Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to ‘longest_first’ truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.

  2. Exception: TypeError: dropout(): argument ‘input’ (position 1) must be Tensor, not str

Hi @dikster99, I had a closer look at the multi-label example I linked to and see that it’s more complicated than it needs to be because:

  • transformers now has a Trainer class that dramatically simplifies the training / evaluation loops.
  • the datasets library is a much better way to prepare the data and works great with the Trainer

To implement multi-label classification, the main thing you need to do is override the forward method of BertForSequenceClassification to compute the loss with a sigmoid instead of softmax applied to the logits. In PyTorch it looks something like

class BertForMultilabelSequenceClassification(BertForSequenceClassification):
    def __init__(self, config):

    def forward(self,
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.bert(input_ids,

        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            loss_fct = torch.nn.BCEWithLogitsLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), 
                            labels.float().view(-1, self.num_labels))

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(loss=loss,

where the only thing that I’ve really changed are these two lines

loss_fct = torch.nn.BCEWithLogitsLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.float().view(-1, self.num_labels))

You can probably adapt the TensorFlow code in a similar fashion (I haven’t used TF in years so can’t be much help there :slight_smile:).

There are some other things needed (e.g. the metrics), so I put together a hacky notebook here that you can use as a template to get started:


Hi again @dikster99, thanks to a tip from Sylvain Gugger, I realised that there’s a much simpler way to implement multi-label classification: just override the compute_loss function of the Trainer!

Here’s an example in PyTorch:

class MultilabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = torch.nn.BCEWithLogitsLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), 
                        labels.float().view(-1, self.model.config.num_labels))
        return (loss, outputs) if return_outputs else loss

and I’ve updated my Colab notebook to reflect the change. Hope that helps!

PS you will need to install transformers from the master branch for this to work, i.e. pip install git+

1 Like

Hi @lewtun ,

thanx for you cool input :slight_smile: :grin:

unfortunately, I am not able to use Datasets because its not installed in my target environment - I usually use Pandas or Spark there - is it possible to substituete Datasets with Pandas or Spark based on direct access to trains.csv and test.csv in your notebook?

I also do not have an Internet Connection in this target environment (for security reasons) which is why I cannot use a direct download link from GitHub or anywhere else :frowning: We have transformers v4.2.1 installed and so I am bound to this version as well.

Would you be able to adjust your sample notebook to these requirements:

  1. use Pandas or Spark for data retrieval
  2. Access data from file system rather than retrieval through Internet connection as I want to replace this later with my own data in our private environment
  3. Use transformers version v4.2.1

Sorry to be such a pain about it… :frowning: :de: :no_pedestrians: :no_bicycles: :no_smoking: :do_not_litter: …but I hope its easy to meet these requirements as I imagine many others working in similarly locked down environments …

Thanx Dirk

PS.: I have found a notebook in the transformers community (after unsucessfully testing 3 others) which seems to meet my requirements (but is much more complex as your approach) but appears to work with transformers 4.2.1 <=> 4.3.3:

Hi @dikster99, sure here’s an example that meets your constraints (I too know the joys of working behind a firewall :grinning_face_with_smiling_eyes: ):

You’ll just have to point data_dir to the location of your files on disk and adapt the code a bit to load both the train and test files (I just worked with the training set for simplicity). You can find more details about creating custom datasets here: Fine-tuning with custom datasets — transformers 4.3.0 documentation

If you can’t access the HuggingFace Hub through your firewall, you’ll need to clone the model repo (as mentioned in a previous post) and upload the model files manually to your env.

Good luck!

1 Like

Hi Lewis,

your solution works great and its really impressiv how easy it is to generate a good prediction accuracy :smile:

I found an extremly good intro to transformers and thinks are adding up for me now - so, I can create a TF version based on your code.

So, thanx a bunch for getting me started :heart: :star: :rainbow: :tada: :high_brightness:


1 Like

mhhh, I guess, I’ve been to positiv too early :frowning:

The problem I am having now is that I can train a model and it realy seems to generate positive accuracy/loss results but the prediction always generates the same class :frowning:

The CoLab notebook is here using the 20 newsgroups dataset as a sample.

Am I using the wrong way to train or is the prediction not correct?

I’ve tried this with different models and different datasets but the result always seems to be the same - the model just always predicts 1 class no matter what I try to enter as sample text :frowning:

When you say “same class”, do you mean that you’re always getting something like for different inputs or do you mean you are getting one class (multiclass) when you expect multiple (multilabel)?

One thing I wonder is whether you need to pass the label2Index and index2label mappings to your model when you initialise it, i.e.

bert = TFAutoModel.from_pretrained(tranformersPreTrainedModelName, label2id=label2Index, id2label=index2label)

That way you ensure the model’s logits line up with the mappings you’re using to generate the predictions at the end of the notebook.

When I say ‘same class’ I mean that it always predicts no matter what text I enter - even on a sports or religions text :frowning: even though this model should generate a MultiClass classification result…

Your suggestion with the

bert = TFAutoModel.from_pretrained(tranformersPreTrainedModelName, label2id=label2Index, id2label=index2label)

seems to change the situation in the way that the model predicts not just one class but about 3 on the 4 sample sentences. But it is still almost always wrong on the actual target even though I trained the model with a seq_len=50 over 32 Epochs with these results:

loss: 0.0456 - accuracy: 0.9882 - val_loss: 0.1228 - val_accuracy: 0.9680

I am confused because I don’t understand why I have to supply the parameters since I am using the model without its in-build heads? I have used the exact same model configuration with a classic TensorFlow/Keras GloVe model and reached a worse accuracy but had better actual predictions - so, there still seems to be something strange here because the measured accuracy just does not add up with the actual prediction result :frowning:


I think I got it to work now - thanks to your tip about the label - index mapping. I don’t really understand why thats needed but it seems to do the trick now as expected. What do you think?

1 Like

nice to hear you got it working @dikster99! i had a quick look at your code and it looks good :slight_smile:

the reason the label → id and id → label mappings are needed is because the default behaviour is to create a dummy mapping based on the number of labels: transformers/ at fa35cda91e9a1929f9cebeb973801709ba31dd4b · huggingface/transformers · GitHub

so in your case, the accuracy, loss etc were all fine because the model was using an internally-consistent definition for the labels, it’s just that the meaning of those labels did not match the ones in the data.

This awful or least it seems to be less than stable :frowning:

So, I tried to apply the code that we already had working for English text just to find out that it does not work for German - because there are not many German public datasets around I created my own.

I then published 3 notebooks on GitHub to show the dilemma:

  1. 66_Multi_Label_German_text_classification_in_TensorFlow_Keras.ipynb
    Shows a Multilabel approach via a classic Keras/TF implementation to show that the German dataset can yield useful results (the results are not great because I did not clean the data extracted originally somewhere else but this shows the dataset can be used for a classification task)

  2. 66_Transformer_4_Language_Classification_MultiLabel_DistilBert.ipynb
    Shows a Multilabel approach via Transformers DistillBert Modeling using the English Toxitity dataset collection as we previously discussed.
    This model is working as expecting and does indeed return useful results.

  3. 66_Transformer_4_Language_Classification_MultiLabel_DistilBert_German.ipynb
    This notebook uses the exact same transformers code approach as we’ve used in 2) with the difference that we try to consume German data here. The accuracy shown during the training phase looks useful but testing at the end of the notebook shows that the actual predictions are less than useful (no single prediction is correct and the model always predicts very similar output no matter what the input is)

I wonder if this is a known bug or whether this should be an issue for the HuggingFace repository?

1 Like

Hi @lewtun,

does that mean
The multi-label problem can be solved directly in Trainer without subclassing BertForSequenceClassification as

class BertForMultilabelSequenceClassification(BertForSequenceClassification):


hey @Loganathan, it’s actually now possible to do multi-label classification for some models without needing to create your own Trainer subclass :tada:

for example, with BERT you can specify the problem_type parameter in the model config as follows:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", problem_type="multi_label_classification")

this ensures that the loss function used in the forward pass is BCEWithLogitsLoss() which is suitable for multi-label tasks. you should then be able to use this model in a standard Trainer :slight_smile:

if your model is not in one of the types listed under problem_type in the config here then you can either subclass the Trainer as i discussed earlier in the thread (using a standard ModelNameForSequenceClassification class) or subclass the model and override the loss calculation the forward pass