Fine-tune MLM in Roberta custom loss (additional component)

orko · January 18, 2024, 7:44am

I want to fine-tune Roberta on MLM task on my own data.
However, for each word, I also have an additional vector with 10 elements.
So whenever I predict a masked token, I want the loss to be:
aMLM loss + bvector_prediction_loss.

How can I do it? didnt find any example or tutorial?

orko · January 21, 2024, 8:31am

Any ideas?

nielsr · January 21, 2024, 12:12pm

You can simply overwrite the Trainer loss: Specify Loss for Trainer / TrainingArguments - #2 by nielsr

orko · January 21, 2024, 1:14pm

@nielsr Let’s say my labels has two parts: label1 and label2.
What will be the best way to pass label1 and label2 to compute_loss ?
I add another columns to the dataset, such that now my columns are:

    features: ['input_ids', 'attention_mask', 'labels', 'labels2'],
    num_rows: 563

But after I init the trainer, if I put a breakpoint in the 1st row of compute_loss(), I see that there is no ‘labels2’ column in inputs

ospanbatyr · March 20, 2024, 12:48pm

Hi all,

I know this is a very late answer but the problem is in the data collator. Probably the collator removes the labels2.

In my case, I wanted to calculate the per-example loss in the compute_loss function and save for later use. Therefore, I needed to pass a “qid” feature.

Adding the following lines to the default collator fixed my issue:

        elif k == "qid":
            batch[k] = [f[k] for f in features]

The full collator function:

from collections.abc import Mapping
import numpy as np

def torch_default_data_collator(features):
    import torch

    if not isinstance(features[0], Mapping):
        features = [vars(f) for f in features]
    first = features[0]
    batch = {}

    # Special handling for labels.
    # Ensure that tensor is created with the correct type
    # (it should be automatically the case, but let's make sure of it.)
    if "label" in first and first["label"] is not None:
        label = first["label"].item() if isinstance(first["label"], torch.Tensor) else first["label"]
        dtype = torch.long if isinstance(label, int) else torch.float
        batch["labels"] = torch.tensor([f["label"] for f in features], dtype=dtype)
    elif "label_ids" in first and first["label_ids"] is not None:
        if isinstance(first["label_ids"], torch.Tensor):
            batch["labels"] = torch.stack([f["label_ids"] for f in features])
        else:
            dtype = torch.long if isinstance(first["label_ids"][0], int) else torch.float
            batch["labels"] = torch.tensor([f["label_ids"] for f in features], dtype=dtype)

    # Handling of all other possible keys.
    # Again, we will use the first element to figure out which key/values are not None for this model.
    for k, v in first.items():
        if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
            if isinstance(v, torch.Tensor):
                batch[k] = torch.stack([f[k] for f in features])
            elif isinstance(v, np.ndarray):
                batch[k] = torch.tensor(np.stack([f[k] for f in features]))
            else:
                batch[k] = torch.tensor([f[k] for f in features])
        elif k == "qid":
            batch[k] = [f[k] for f in features]

    return batch

Topic		Replies	Views
How loss is calculated in MLM training 🤗Transformers	0	847	April 1, 2022
Fine tune Masked Language Model on custom dataset Beginners	5	6064	August 20, 2020
Further pre-training the tokenizer? 🤗Tokenizers	0	821	April 30, 2022
Opinion: Training Argument Fine Tuning MLM RoBERTa Intermediate	1	201	January 9, 2025
How can I check mlm accuracy during training RoBERTa? Beginners	7	2677	August 30, 2021

Fine-tune MLM in Roberta custom loss (additional component)

Related topics