How data should be structured to Fine-Tune a CausalLM

Hi all,

Here’s a scenario: I want to fine-tune GPT-2 on the English subset of wikimedia/wikipedia. This dataset is, famously, a collections of blocks of text that looks a bit like figure 1 below.

I’m going to be fine-tuning using TensorFlow (just because I need to use it for a downstream experiment afterward). Because GPT-2 is autoregressive, I’m assuming the data should therefore look a bit like figure 2 (only tokenized, batched and shuffled, padded etc. ).

Is this assumption correct? There’s no documentation anywhere on this question, and I guess my reasoning is that the output of a causalLM is its predicted logits for the token after input. This data structure therefore plays into sparse categorical loss funcs and, importantly, is uniform/rectangular in shape.

If this assumption is wrong, I can’t visualise what else could be the case. Examples of fine-tuning instruct models seem to have no labels at all, rather just fitting the model over a set of input_ids (see: https://ai.google.dev/gemma/docs/lora_tuning#load_dataset), but I tested this with my implementation and the error was something along the lines of “there was no label column detected in the dataset” - makes sense, given all that was going in was input_ids and an attention mask.

This topic is a real gap in the documentation imo, but I might just be thinking too hard about it haha. Any help would be appreciated.

NM

Causal language modeling See the documentation in the link for guidance of fine-tuning causal language modelling.