How data should be structured to Fine-Tune a CausalLM

NevadaM · June 26, 2024, 12:15pm

Hi all,

Here’s a scenario: I want to fine-tune GPT-2 on the English subset of wikimedia/wikipedia. This dataset is, famously, a collections of blocks of text that looks a bit like figure 1 below.

I’m going to be fine-tuning using TensorFlow (just because I need to use it for a downstream experiment afterward). Because GPT-2 is autoregressive, I’m assuming the data should therefore look a bit like figure 2 (only tokenized, batched and shuffled, padded etc. ).

Is this assumption correct? There’s no documentation anywhere on this question, and I guess my reasoning is that the output of a causalLM is its predicted logits for the token after input. This data structure therefore plays into sparse categorical loss funcs and, importantly, is uniform/rectangular in shape.

If this assumption is wrong, I can’t visualise what else could be the case. Examples of fine-tuning instruct models seem to have no labels at all, rather just fitting the model over a set of input_ids (see: https://ai.google.dev/gemma/docs/lora_tuning#load_dataset), but I tested this with my implementation and the error was something along the lines of “there was no label column detected in the dataset” - makes sense, given all that was going in was input_ids and an attention mask.

This topic is a real gap in the documentation imo, but I might just be thinking too hard about it haha. Any help would be appreciated.

NM

MattiLinnanvuori · June 28, 2024, 6:58am

Causal language modeling See the documentation in the link for guidance of fine-tuning causal language modelling.

Topic		Replies	Views
Training GPT-type models for classification tasks CausalLM vs SequenceClassification Models	2	1096	August 7, 2024
How to label dataset for Causal Language Modeling Beginners	0	522	January 27, 2023
Finetune GPT2 in tensorflow on custom data example programmatically Beginners	0	487	July 23, 2020
Finetune LLaMA2 model with datasets missing labels 🤗Transformers	0	373	February 15, 2024
Fine tune GPT2/LLaMA in seq2seq manner 🤗Transformers	2	1553	January 14, 2024

How data should be structured to Fine-Tune a CausalLM

Related topics