How Labelled Data is Processed | Transformers Trainer

Hello, I am pretty new to fine-tuning and even NLP as a whole. I havenā€™t been working with the Transformers library for very long so thereā€™s much that I donā€™t know or fully understand.

Iā€™ve been trying to make sense of how the Transformers Trainer makes use of ā€œlabelsā€, ā€œinput_idsā€, and ā€œattention_masksā€ during the fine-tuning process.

Here is how I understood the overall process:

Supervised Fine-tuning:

  • to perform SFT with the Trainer class, must explicitly provide ā€œlabelsā€ as part of your train/eval dataset
  • when ā€œlabelsā€ are not provided, the Trainer class still functions fine but the process would then be classified as Unsupervised Fine-tuning
  • the Trainer has default metrics that is computes (perplexityā€¦?) for evaluation and you can explicitly provide other metrics to compute with the compute_metrics argument of the Trainer class
  • fine-tuning is performed by somehow using the labels and input_ids, with a certain loss function (cross_entropy lossā€¦?)

Here are some questions that Iā€™m struggling with:

  1. In many of the fine-tuning tutorials Iā€™ve seen, the authors use the Trainer class to train a model initialized with AutoModelForCausalLM. Does the ā€œCausalLMā€ part indicate that the model has been initialized specifically for next token predictionā€¦? If so, does the way that the model is initialized (ā€¦ForCausalLM, ā€¦ForSeq2Seq, etc.) have an effect on the way the Trainer performs fine-tuning? Similarly, do the different task types entail different formatting techniques when it comes to the data itself?

  2. Some of the demo code Iā€™ve seen implements supervised fine-tuning by using the mask token id (-100) to mask everything except for what the model would ideally output (nox project). I wanted to know how exactly the mask token is processed by the language models during fine-tuning, but canā€™t seem to find any decent sites/explanations.

Iā€™ve been reading the HuggingFace documentation pages but canā€™t seem to find enough information when it comes to the specifics. If there are any sources/guides about any of the topics, I would really appreciate them.

1 Like

To answer your first question:

ForCausalLM is the task head that the model has. The trainer is (roughly) model agnostic it only requires that our inputs and labels match the input and output dimension of the model. therefore the same procedure can be used to train networks that have a different task head. If I were to chop off the CausalLM head and add on a TokenClassfication head then my model would need to be finetuned to ensure that the knowledge in the body of the model is able to work with the transplanted head.

Changing the head is simple. However, finetuning is then necessary. Your data should be framed dependant on the task. Data that is labelled on the paragraph level will not be useful for classification on the word level.

Second Question:

The Mask token (-100) is a default parameter of many Pytorch loss function. It basically asks that we ignore the current token when computing loss, which means that the ignored token will not impact model weights.

I have an article on the -100 padding ID and its place within the training:

Transformer Attention and Tokenisation | N E R | D S (medium.com)

Thank you for the detailed reply! I read up on the article you wrote, and it was incredibly insightful. I do, however, have a few additional questions stemming from your response. I would be grateful if you could spare some time to answer them: 1. If I understood your reply correctly, it seems that the AutoModelForCausalLM class loads 1) the model and 2) the head associated with the CausalLM task. If that is the case, does the presence of the head have an effect on how the model is trained by the Trainer? 2. Would you say that models initialized with the AutoModelForCausalLM class are suitable for training with labels? The CausalLM task (from my understanding) is designed to be an unsupervised task (next token prediction from previous tokens), and I canā€™t really wrap my head around training a model initialized for such a task to use labelled dataā€¦ And yet the model trains fine when I provide its trainer with the following form of train_dataset and eval_dataset:

The head does not impact the training algorithm in any way (Though your data much match the head). You still just pass you data as tokens to the model and it does a forward pass. You then receive the result of that forward pass which often contains the loss, logits and maybe attentions. You can inspect the forward pass of any model on HuggingFace easily on the ModelClass.

No matter what head I put on my model, I am still just doing a forward pass to get outputs. This forward pass has been written for you. If you want to add a custom (previously unsupported) head then you will need to overwrite the forward pass to ensure those layers are used.

If you just use Automodel, you will receive a headless model. If you use any of the task orientated classes such as ForCausalLM it will ship with a head that typically needs to be finetuned because the head has random weights while the body uses the pretrained weights.

I would not use CausalLM for classification because its goal it to predict the next token GIVEN previous tokens. With classification of text we want to consume all tokens and map them to some output (class). We may also want to classify at the token level which means we want to consume all tokens and map each token to an output (class). In any case, classification is more suited (IMO) to encoder models such as BERT, RoBERTa etc.

It is technically possible to do a rudimentary classification using a language model through few shot prompting (perhaps with some RAG too). However, I think this task is more native to encoders rather than decoders.

I see, thank you for that insight! Would you say that the CausalLM head is suitable for instruction tuning? Like classification, instruction tuning does make use of labels but I get the feeling itā€™s not quite the same. Iā€™ve reformatted my dataset into the form

List[Dict[str, str]]

where each Dict consists of ā€˜input_idsā€™, ā€˜attention_masksā€™, and ā€˜labelsā€™ like so:

{
     'input_ids': List[int],
     'labels' : List[int],
     'attention_masks' : List[int]
}

The data Iā€™m working with is the databricks dataset and itā€™s been formatted with the following templates:

input_template = '''### Instruction
{instruction}

### Context
{context}

### Your response'''

target_template = '''{response}'''

The labels have been produced by masking all of the tokens in the input_template and leaving only the response unmasked. I feed the above list of dictionaries straight into the transformers Trainer, along with a model (Gemma) initialized with AutoModelForCausalLM. If this isnā€™t an appropriate way of processing this type of dataset, what would you recommend (different task head, no task head)?

Im actually not sure with this one. I havent instruction tuned a model yet, though It is in my not so far away future. I would be happy if you could share any insights you learn with me as I will probably be walking this road soon :smile:

Iā€™ll let you know if I come across any insights :slight_smile:
Thank you for all the help.

Iā€™ve gained a little bit of insight into the problem so Iā€™m sharing this here for reference.

To summarize what the problem was, what confused me was the fact that the ā€œCausal Language Modelingā€ task that I had initialized my model for is one that is inherently unsupervised (that is, it does not make use of labels as the input themselves are the labels), but I had structured my data to include said ā€œlabelsā€ and wasnā€™t sure how the Trainer made use of them when training.

Since my code compiled and ran fine, I had assumed that the trainer was doing what I had wanted and somehow making use of the labelled data, but it turns out that this was actually not the case.

In actuality the trainer was NOT making use of the ā€˜labelsā€™ I had provided, and I think the problem lies with the fact that I had been using the DataCollatorForCausalModeling for my trainer. Apparently data collators are not just designed to create batches from your data. Depending on its class type, the data collator may also perform some data augmentation, which was the case for the DataCollatorForCausalModeling class that I was utilizing.

I took a look inside the code for this DataCollator class and notirced that this particular data collator seemed to be ignoring the presence of labels. That is, the data collator essentially assumes (from my understanding) that the data provided to the collator does not contain labels and generates a new set of labels for the dataset. More concretely, one of the arguments that can be used when initializing the class is the ā€œmlmā€ argument and when it is set to ā€œFalseā€, the data collator creates ā€œlabelsā€ by cloning the ā€œinput_idsā€ of the dataset.

Source: DataCollator File in Transformers GitHub

This was what I had been doing which means that even though I had structured my dataset to include labels, by the time they had passed through the data collator and made their way into the training process of the Trainer, the original labels I had manually created had been lost and replaced with a copy of the ā€œinput_idsā€. Considering the nature of the CLM task, that seems reasonable to me.

( As an aside, when the ā€œmlmā€ argument is set to True, the data collator performs masking with the MLM approach )

I think what I need to do is either figure out which of the Data Collators suits my needs or create an entirely new custom DataCollator class. There is a Medium article about this topic that I think Iā€™ll be referencing, so Iā€™m sharing that here as well. The article also provides example code for how one might create a DataCollator class for an Instruction Tuning dataset, which I found particularly helpful. It also provides a better explanation of data collators than I can, so I recommend reading it if you can (requires a Medium subscription unfortunately).

So the conclusion is that the CLM task is NOT suitable for labelled data, so I need to configure my trainer and data collator in a way that better suits my dataset.

I still donā€™t really know how ā€œwhich model head the model is initialized withā€ affects the training process (for example, does the task head the model is initialized with influence how the loss is calculated?), so I think Iā€™ll be looking into that next.

2 Likes

Great information, thank you for sharing.

The model head is just an output layer that matches a task. Loss is either handled by you directly or within the forward pass of the model.

I see, thank you!

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.