How Labelled Data is Processed | Transformers Trainer

Chahnwoo · April 8, 2024, 4:12am

Hello, I am pretty new to fine-tuning and even NLP as a whole. I haven’t been working with the Transformers library for very long so there’s much that I don’t know or fully understand.

I’ve been trying to make sense of how the Transformers Trainer makes use of “labels”, “input_ids”, and “attention_masks” during the fine-tuning process.

Here is how I understood the overall process:

Supervised Fine-tuning:

to perform SFT with the Trainer class, must explicitly provide “labels” as part of your train/eval dataset
when “labels” are not provided, the Trainer class still functions fine but the process would then be classified as Unsupervised Fine-tuning
the Trainer has default metrics that is computes (perplexity…?) for evaluation and you can explicitly provide other metrics to compute with the compute_metrics argument of the Trainer class
fine-tuning is performed by somehow using the labels and input_ids, with a certain loss function (cross_entropy loss…?)

Here are some questions that I’m struggling with:

In many of the fine-tuning tutorials I’ve seen, the authors use the Trainer class to train a model initialized with AutoModelForCausalLM. Does the “CausalLM” part indicate that the model has been initialized specifically for next token prediction…? If so, does the way that the model is initialized (…ForCausalLM, …ForSeq2Seq, etc.) have an effect on the way the Trainer performs fine-tuning? Similarly, do the different task types entail different formatting techniques when it comes to the data itself?
Some of the demo code I’ve seen implements supervised fine-tuning by using the mask token id (-100) to mask everything except for what the model would ideally output (nox project). I wanted to know how exactly the mask token is processed by the language models during fine-tuning, but can’t seem to find any decent sites/explanations.

I’ve been reading the HuggingFace documentation pages but can’t seem to find enough information when it comes to the specifics. If there are any sources/guides about any of the topics, I would really appreciate them.

swtb · April 9, 2024, 12:14pm

To answer your first question:

ForCausalLM is the task head that the model has. The trainer is (roughly) model agnostic it only requires that our inputs and labels match the input and output dimension of the model. therefore the same procedure can be used to train networks that have a different task head. If I were to chop off the CausalLM head and add on a TokenClassfication head then my model would need to be finetuned to ensure that the knowledge in the body of the model is able to work with the transplanted head.

Changing the head is simple. However, finetuning is then necessary. Your data should be framed dependant on the task. Data that is labelled on the paragraph level will not be useful for classification on the word level.

Second Question:

The Mask token (-100) is a default parameter of many Pytorch loss function. It basically asks that we ignore the current token when computing loss, which means that the ignored token will not impact model weights.

I have an article on the -100 padding ID and its place within the training:

Transformer Attention and Tokenisation | N E R | D S (medium.com)

Chahnwoo · April 11, 2024, 4:46am

Thank you for the detailed reply! I read up on the article you wrote, and it was incredibly insightful. I do, however, have a few additional questions stemming from your response. I would be grateful if you could spare some time to answer them: 1. If I understood your reply correctly, it seems that the AutoModelForCausalLM class loads 1) the model and 2) the head associated with the CausalLM task. If that is the case, does the presence of the head have an effect on how the model is trained by the Trainer? 2. Would you say that models initialized with the AutoModelForCausalLM class are suitable for training with labels? The CausalLM task (from my understanding) is designed to be an unsupervised task (next token prediction from previous tokens), and I can’t really wrap my head around training a model initialized for such a task to use labelled data… And yet the model trains fine when I provide its trainer with the following form of train_dataset and eval_dataset:

swtb · April 11, 2024, 8:35am

The head does not impact the training algorithm in any way (Though your data much match the head). You still just pass you data as tokens to the model and it does a forward pass. You then receive the result of that forward pass which often contains the loss, logits and maybe attentions. You can inspect the forward pass of any model on HuggingFace easily on the ModelClass.

No matter what head I put on my model, I am still just doing a forward pass to get outputs. This forward pass has been written for you. If you want to add a custom (previously unsupported) head then you will need to overwrite the forward pass to ensure those layers are used.

If you just use Automodel, you will receive a headless model. If you use any of the task orientated classes such as ForCausalLM it will ship with a head that typically needs to be finetuned because the head has random weights while the body uses the pretrained weights.

I would not use CausalLM for classification because its goal it to predict the next token GIVEN previous tokens. With classification of text we want to consume all tokens and map them to some output (class). We may also want to classify at the token level which means we want to consume all tokens and map each token to an output (class). In any case, classification is more suited (IMO) to encoder models such as BERT, RoBERTa etc.

It is technically possible to do a rudimentary classification using a language model through few shot prompting (perhaps with some RAG too). However, I think this task is more native to encoders rather than decoders.

Chahnwoo · April 12, 2024, 12:21am

I see, thank you for that insight! Would you say that the CausalLM head is suitable for instruction tuning? Like classification, instruction tuning does make use of labels but I get the feeling it’s not quite the same. I’ve reformatted my dataset into the form

List[Dict[str, str]]

where each Dict consists of ‘input_ids’, ‘attention_masks’, and ‘labels’ like so:

{
     'input_ids': List[int],
     'labels' : List[int],
     'attention_masks' : List[int]
}

The data I’m working with is the databricks dataset and it’s been formatted with the following templates:

input_template = '''### Instruction
{instruction}

### Context
{context}

### Your response'''

target_template = '''{response}'''

The labels have been produced by masking all of the tokens in the input_template and leaving only the response unmasked. I feed the above list of dictionaries straight into the transformers Trainer, along with a model (Gemma) initialized with AutoModelForCausalLM. If this isn’t an appropriate way of processing this type of dataset, what would you recommend (different task head, no task head)?

swtb · April 12, 2024, 9:50am

Im actually not sure with this one. I havent instruction tuned a model yet, though It is in my not so far away future. I would be happy if you could share any insights you learn with me as I will probably be walking this road soon

Chahnwoo · April 13, 2024, 5:58pm

I’ll let you know if I come across any insights
Thank you for all the help.

Chahnwoo · April 14, 2024, 8:44am

I’ve gained a little bit of insight into the problem so I’m sharing this here for reference.

To summarize what the problem was, what confused me was the fact that the “Causal Language Modeling” task that I had initialized my model for is one that is inherently unsupervised (that is, it does not make use of labels as the input themselves are the labels), but I had structured my data to include said “labels” and wasn’t sure how the Trainer made use of them when training.

Since my code compiled and ran fine, I had assumed that the trainer was doing what I had wanted and somehow making use of the labelled data, but it turns out that this was actually not the case.

In actuality the trainer was NOT making use of the ‘labels’ I had provided, and I think the problem lies with the fact that I had been using the DataCollatorForCausalModeling for my trainer. Apparently data collators are not just designed to create batches from your data. Depending on its class type, the data collator may also perform some data augmentation, which was the case for the DataCollatorForCausalModeling class that I was utilizing.

I took a look inside the code for this DataCollator class and notirced that this particular data collator seemed to be ignoring the presence of labels. That is, the data collator essentially assumes (from my understanding) that the data provided to the collator does not contain labels and generates a new set of labels for the dataset. More concretely, one of the arguments that can be used when initializing the class is the “mlm” argument and when it is set to “False”, the data collator creates “labels” by cloning the “input_ids” of the dataset.

Source: DataCollator File in Transformers GitHub

This was what I had been doing which means that even though I had structured my dataset to include labels, by the time they had passed through the data collator and made their way into the training process of the Trainer, the original labels I had manually created had been lost and replaced with a copy of the “input_ids”. Considering the nature of the CLM task, that seems reasonable to me.

( As an aside, when the “mlm” argument is set to True, the data collator performs masking with the MLM approach )

I think what I need to do is either figure out which of the Data Collators suits my needs or create an entirely new custom DataCollator class. There is a Medium article about this topic that I think I’ll be referencing, so I’m sharing that here as well. The article also provides example code for how one might create a DataCollator class for an Instruction Tuning dataset, which I found particularly helpful. It also provides a better explanation of data collators than I can, so I recommend reading it if you can (requires a Medium subscription unfortunately).

So the conclusion is that the CLM task is NOT suitable for labelled data, so I need to configure my trainer and data collator in a way that better suits my dataset.

I still don’t really know how “which model head the model is initialized with” affects the training process (for example, does the task head the model is initialized with influence how the loss is calculated?), so I think I’ll be looking into that next.

swtb · April 15, 2024, 2:45pm

Great information, thank you for sharing.

The model head is just an output layer that matches a task. Loss is either handled by you directly or within the forward pass of the model.

Chahnwoo · April 16, 2024, 6:12am

I see, thank you!

system · April 16, 2024, 6:13pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Processing the [-100] Mask in SFT 🤗Transformers	2	1146	April 9, 2024
Fine-tuning BERT with deterministic masking instead of random masking Beginners	0	162	April 22, 2024
Finetuning with Trainer doesn't seem to learn since second epoch Beginners	3	2416	January 19, 2023
Fine-tuning queries Beginners	0	39	February 20, 2025
How do I use a fine-tuned Trainer model for inference correctly? 🤗Transformers	0	981	June 9, 2023

How Labelled Data is Processed | Transformers Trainer

Related topics