Multilabel token classification (dataloader issues)

ulyanaisaeva · April 20, 2024, 9:21am

Hi! I am working on a multilabel token classification problem. Did not have much success localizing the problem precisely, but I suppose it is somewhere in data collation and/or loading.

The dataset has the following format:

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

where (multi-hot encoded) labels are of shape (sequence_len, num_classes).

Using a default DataCollatorForTokenClassification throws an error in torch_call() because this class expects a 1D array of label_ids (as in usual sequence classification). This is why I implemented a custom data collator:

class DataCollatorForMultilabelTokenClassification(DataCollatorForTokenClassification):
    def torch_call(self, features):

        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature[label_name] for feature in features] if label_name in features[0].keys() else None

        no_labels_features = [{k: v for k, v in feature.items() if k != label_name} for feature in features]

        batch = self.tokenizer.pad(
            no_labels_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        if labels is None:
            return batch

        sequence_length = batch["input_ids"].shape[1]
        padding_side = self.tokenizer.padding_side

        if padding_side == "right":
            batch[label_name] = []
            for label in labels:
                padding = np.full((sequence_length - len(label), len(label[0])), self.label_pad_token_id, dtype=int)
                padded = np.concatenate((np.array(label), padding)).tolist()
            batch[label_name].append(padded)
        else:
            batch[label_name] = []
            for label in labels:
                padding = np.full((sequence_length - len(label), len(label[0])), self.label_pad_token_id, dtype=int)
                padded = np.concatenate((padding, np.array(label))).tolist()
            batch[label_name].append(padded)

        batch[label_name] = torch.tensor(batch[label_name], dtype=torch.int64)
        return batch


data_collator = DataCollatorForMultilabelTokenClassification(tokenizer=tokenizer, label_pad_token_id=-100)

But during the training the dataloader (which is default, I did not configure it at all) loads a batch with wrond dimensions:

input_ids (and other tokenization fields) are of size (batch_size, sequence_len)
labels are expected to be of size (batch_size, sequence_len, num_classes), and this is what they are after data collator torch_call(), but they come to the loss function in the shape of (1, sequence_len, num_classes), so it it is only one sample of the batch…

Do I miss something? I would appreciate any directions:)

Topic		Replies	Views
Using datacollator for multi-task training Intermediate	2	1185	January 24, 2022
Unable to train token classification model 🤗Transformers	0	297	April 27, 2023
Multi-label token classification 🤗Transformers	34	7703	September 6, 2023
Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline 🤗Datasets	1	5034	June 21, 2023
Multi-label token classification: "-100" special label 🤗Transformers	1	504	September 18, 2023

Multilabel token classification (dataloader issues)

Related topics