Most efficient multi-label classifier?

qedgary · August 17, 2021, 3:45pm

Background

I’m trying to train a model in Tensorflow to classify text according to a fixed set of 5 labels. For example, let’s say I feed my model the following text:

“my advice is that you go ahead with your plans to learn Python, because its syntax is easy for beginners. It’s also great for snake lovers like me!”

After sniffing the text, the model would, ideally, report back how much the text matches my pre-defined labels:

       Label             Prediction
--------------------     ----------
programming_advice          0.99
advice_for_beginners        0.91
cooking_advice              0.11
health_advice               0.10
not_advice                  0.01

My question

What is the most efficient way to build such a classifier? I’ve seen several options to do this, but I’m not sure which one would be best:

Fine-tune five different binary classifiers, since there are five labels… but this would take forever to train, so I assume there must be a better way.
Make a model with a transformer only, and train it.
- Seen in this forum question
Make a model with a transformer plus my own Dense layers, and train it.
- Seen in this sample notebook, which was linked in the transformers documentation.
Make a model with a transformer plus my own Dense layers—but freeze the transformer as-is, and only train the Dense layers.
- Freezing is a common practice with pre-trained computer vision models; I don’t know whether it’s also good practice for NLP too.

I would be grateful for any suggestions on which of 1-4 works best. I’m still rather new around here, but the Huggingface community is extremely welcoming and helpful, and I appreciate being here! A big thanks for anybody who can help give me some pointers.

marlon89 · August 23, 2021, 2:54pm

I am facing the same problem and as none replied yet I wanted to ask if you got any updates/new thoughts on this? Cheers

nielsr · August 23, 2021, 6:51pm

Hi,

Option 2 is indeed the best. To train a multi-label classifier, you can use an xxxForSequenceClassification model (which is a Transformer encoder with a linear layer on top), and set the problem_type attribute of the configuration to multi_label_classification. For example, if you want to use BERT, you can do it as follows:

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", problem_type="multi_label_classification")

As you can see in the code, it will use the BCE (binary cross-entropy) loss.

Note that if the number of labels you have is > 2, you also need to specify num_labels=... when calling the .from_pretrained() method.

vettukal · September 1, 2022, 11:18pm

Below is the code sample that I managed to make it work by using the multi_label_classification question



import torch
from torch.utils.data.dataset import Dataset

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Example data. 
# In reality, the strings are usually longer and there are 11 possible classes
texts = [
    "This is the first sentence.",
    "This is the second sentence.",
    "This is another sentence.",
    "Finally, the last sentence.",
]

labels = [
    [0.99, 0.91, 0.11, 0.10, 0.01],
    [0.89, 0.51, 0.01, 0.10, 0.01],
    [0.39, 0.21, 0.11, 0.10, 0.11],
    [0.29, 0.91, 0.51, 0.20, 0.51],
]


train_texts = texts[:2]
train_labels = labels[:2]

eval_texts = texts[2:]
eval_labels = labels[2:]

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

train_encodings = tokenizer(train_texts, padding="max_length", truncation=True, max_length=512)
eval_encodings = tokenizer(eval_texts, padding="max_length", truncation=True, max_length=512)


class TextClassifierDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

train_dataset = TextClassifierDataset(train_encodings, train_labels)
eval_dataset = TextClassifierDataset(eval_encodings, eval_labels)

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", 
    problem_type="multi_label_classification",
    num_labels=5
)

training_arguments = TrainingArguments(
    output_dir=".",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
)

trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

Topic		Replies	Views
Multiclass vs Multilabel Beginners	1	2693	August 11, 2020
Multilabel text classification Trainer API Beginners	8	22722	August 2, 2023
Multi-class Classification Basics Beginners	4	4893	August 24, 2021
Multilabel classification for text Beginners	1	504	January 15, 2021
Finetuning from multiclass to mutlilabel Intermediate	4	805	September 1, 2021

Most efficient multi-label classifier?

Background

My question

Related topics