Fine-Tune for MultiClass or MultiLabel-MultiClass

Ashi0590 · October 5, 2021, 2:25pm

Hi Lewtun

I have 200 classes which model will i prefer for getting good accuracy and also imbalance datasets

nickmuchi · December 16, 2021, 3:40am

any idea here? struggling with that at the moment

jacobajit · February 8, 2022, 9:40am

Hi @lewtun , I’m getting an error when trying to load the dataset:

ValueError: Couldn't cast
id: string
comment_text: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 490
to
{'id': Value(dtype='string', id=None), 'comment_text': Value(dtype='string', id=None), 'toxic': Value(dtype='int64', id=None), 'severe_toxic': Value(dtype='int64', id=None), 'obscene': Value(dtype='int64', id=None), 'threat': Value(dtype='int64', id=None), 'insult': Value(dtype='int64', id=None), 'identity_hate': Value(dtype='int64', id=None)}
because column names don't match

lewtun · February 8, 2022, 10:50am

Hey @jacobajit are you referring to the go_emotions dataset? I’m able to load it using datasets v1.18.2 with

dataset = load_dataset('go_emotions', 'simplified')

Perhaps you can upgrade your datasets version and try again?

jacobajit · February 8, 2022, 10:55am

@lewtun No, this is in reference to the toxic comments dataset that’s to be loaded in this cell in the multi label example notebook above (transformers_multilabel-text-classification-with-problem-type.ipynb - Google Drive):

# path to train.csv test.csv and test_labels.csv
data_dir = Path("/content/gdrive/MyDrive/Colab Notebooks/data")
ds = (load_dataset("jigsaw_toxicity_pred", data_dir=data_dir, split='train')
        .train_test_split(train_size=800, test_size=200))
ds

I’ve placed train.csv, test.csv, and test_labels.csv in the data_dir but still getting this error.

lewtun · February 8, 2022, 2:04pm

I’ve just downloaded the dataset from Kaggle, unzipped the files and created a directory with the following structure:

jigsaw-toxic-comment-classification-challenge
├── test.csv
├── test_labels.csv
└── train.csv

I was then able to load it using your command without error. It seems like there might be a problem with either the raw data you’re working with or (possibly) the pandas version that’s used in the underlying dataset loading script. I’m using pandas v1.2.5 so maybe try upgrading and see if that works?

ReemAlJunaid94 · November 15, 2022, 11:58pm

I’m working now on Multi-label Classification using Hugging Face Transformers and I added problem_type argument in the model initialization exactly the same as you did but it gave me this error:

TypeError: init() got an unexpected keyword argument ‘problem_type’

def model_init():

return AutoModelForSequenceClassification.from_pretrained(

model_name,

return_dict=True,

num_labels=7,

problem_type=“multi_label_classification”

)

Could you please help me in solving this issue? My complete code which includes preparing the dataset for training is here.

What are the necessary modifications I need to do?

Thank you for your help in advance.

nielsr · November 16, 2022, 10:00am

Hi,

Check our this notebook for fine-tuning any model for multi-label classification: Transformers-Tutorials/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub.

trasformernewbe · February 9, 2023, 10:56am

Most of the examples shared deal with very small label set, where generating one-hot encoded vectors is not an issue. But how can we train a model with hundreds/thousands of labels? Converting the labels to one-hot encoded vectors is not an option.

prit-sk · February 28, 2023, 9:35am

hi @nielsr @sgugger , I am stuck with some different kind of issue, a different kind of problem. I have read the above thread and it was great reading. Also, I referred to the notebook that you have shared. Please check my problem below.

I have three classes (0,1,2) and 100 labels per instance.
label format: <0,2,0,1,0,2,…,100th label>

so how do i model this, would the exactly same notebook which you have shared work for this. I actually tried the notebook. I encoded label to 1-Hot encoded labels. so the label vector has a size of 100*3 = 300.
so the model which I have trained is correct?

prit-sk · February 28, 2023, 9:36am

exactly, faced the same issue, where i have 100 labels per instance which would sum to label vector size of 300. the model trained but I am not sure with the results that I got

prit-sk · March 5, 2023, 10:07am

hi @lewtun , can you please answer this

hamid-ahmadian · May 22, 2023, 1:05pm

I have the same issue for large labels set so I implement converting from raw labels to one hot encoded in loss functions for each batch. I’ve trained my multi label classification problem for more thn 6000 labels with the custom loss below:

class MultiLabelTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.criterion = BCEWithLogitsLoss()

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels").view(-1)
        outputs = model(**inputs)
        logits = outputs["logits"]
        one_hot_labels = torch.nn.functional.one_hot(labels, num_classes=len(uuidlabel_to_ids)).to(dtype=torch.float)
        
        loss = self.criterion(logits, one_hot_labels)

        return (loss, outputs) if return_outputs else loss

but it might be helpful to play with pos_weight in BCEWithLogitCrossEntrophy as this loss function tends to predict major classes.

Topic		Replies	Views
Predicting On New Text With Fine-Tuned Multi-Label Model Beginners	4	5152	December 23, 2021
Multiclass vs Multilabel Beginners	1	2611	August 11, 2020
Multi-class Classification Basics Beginners	4	4577	August 24, 2021
Finetuning from multiclass to mutlilabel Intermediate	4	780	September 1, 2021
Sequence Classification -- Fine Tune? Beginners	3	3138	January 31, 2021

Fine-Tune for MultiClass or MultiLabel-MultiClass

Related topics