Fine-Tune for MultiClass or MultiLabel-MultiClass

Hi Lewtun

I have 200 classes which model will i prefer for getting good accuracy and also imbalance datasets

any idea here? struggling with that at the moment

Hi @lewtun , I’m getting an error when trying to load the dataset:

ValueError: Couldn't cast
id: string
comment_text: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 490
to
{'id': Value(dtype='string', id=None), 'comment_text': Value(dtype='string', id=None), 'toxic': Value(dtype='int64', id=None), 'severe_toxic': Value(dtype='int64', id=None), 'obscene': Value(dtype='int64', id=None), 'threat': Value(dtype='int64', id=None), 'insult': Value(dtype='int64', id=None), 'identity_hate': Value(dtype='int64', id=None)}
because column names don't match

Hey @jacobajit are you referring to the go_emotions dataset? I’m able to load it using datasets v1.18.2 with

dataset = load_dataset('go_emotions', 'simplified')

Perhaps you can upgrade your datasets version and try again?

@lewtun No, this is in reference to the toxic comments dataset that’s to be loaded in this cell in the multi label example notebook above (transformers_multilabel-text-classification-with-problem-type.ipynb - Google Drive):

# path to train.csv test.csv and test_labels.csv
data_dir = Path("/content/gdrive/MyDrive/Colab Notebooks/data")
ds = (load_dataset("jigsaw_toxicity_pred", data_dir=data_dir, split='train')
        .train_test_split(train_size=800, test_size=200))
ds

I’ve placed train.csv, test.csv, and test_labels.csv in the data_dir but still getting this error.

I’ve just downloaded the dataset from Kaggle, unzipped the files and created a directory with the following structure:

jigsaw-toxic-comment-classification-challenge
├── test.csv
├── test_labels.csv
└── train.csv

I was then able to load it using your command without error. It seems like there might be a problem with either the raw data you’re working with or (possibly) the pandas version that’s used in the underlying dataset loading script. I’m using pandas v1.2.5 so maybe try upgrading and see if that works?

I’m working now on Multi-label Classification using Hugging Face Transformers and I added problem_type argument in the model initialization exactly the same as you did but it gave me this error:

TypeError: init() got an unexpected keyword argument ‘problem_type’

def model_init():

return AutoModelForSequenceClassification.from_pretrained(

model_name,

return_dict=True,

num_labels=7,

problem_type=“multi_label_classification”

)

Could you please help me in solving this issue? My complete code which includes preparing the dataset for training is here.

What are the necessary modifications I need to do?

Thank you for your help in advance.

Hi,

Check our this notebook for fine-tuning any model for multi-label classification: Transformers-Tutorials/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub.

2 Likes

Most of the examples shared deal with very small label set, where generating one-hot encoded vectors is not an issue. But how can we train a model with hundreds/thousands of labels? Converting the labels to one-hot encoded vectors is not an option.

hi @nielsr @sgugger , I am stuck with some different kind of issue, a different kind of problem. I have read the above thread and it was great reading. Also, I referred to the notebook that you have shared. Please check my problem below.

I have three classes (0,1,2) and 100 labels per instance.
label format: <0,2,0,1,0,2,…,100th label>

so how do i model this, would the exactly same notebook which you have shared work for this. I actually tried the notebook. I encoded label to 1-Hot encoded labels. so the label vector has a size of 100*3 = 300.
so the model which I have trained is correct?

exactly, faced the same issue, where i have 100 labels per instance which would sum to label vector size of 300. the model trained but I am not sure with the results that I got

hi @lewtun , can you please answer this

I have the same issue for large labels set so I implement converting from raw labels to one hot encoded in loss functions for each batch. I’ve trained my multi label classification problem for more thn 6000 labels with the custom loss below:

class MultiLabelTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.criterion = BCEWithLogitsLoss()

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels").view(-1)
        outputs = model(**inputs)
        logits = outputs["logits"]
        one_hot_labels = torch.nn.functional.one_hot(labels, num_classes=len(uuidlabel_to_ids)).to(dtype=torch.float)
        
        loss = self.criterion(logits, one_hot_labels)

        return (loss, outputs) if return_outputs else loss

but it might be helpful to play with pos_weight in BCEWithLogitCrossEntrophy as this loss function tends to predict major classes.

1 Like