How to label dataset for Causal Language Modeling

tomaxe · January 27, 2023, 8:13pm

Hello, I was wondering if labeling my dataset would lead to better results in the fine tuning of a causal model, I have seen several code examples in which they labelled and others where they don’t.
I went into the source code for GPTNeoForCausalLM forward function

if labels is not None:
            # Compute loss in fp32 to match with mesh-tf version
            # https://github.com/EleutherAI/gpt-neo/blob/89ce74164da2fb16179106f54e2269b5da8db333/models/gpt2/gpt2.py#L179
            lm_logits = lm_logits.to(torch.float32)

            # Shift so that tokens < n predict n
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

            lm_logits = lm_logits.to(hidden_states.dtype)
            loss = loss.to(hidden_states.dtype)

So If I want to use labels can I simply copy the input_ids for the labels ? Or do I need to worry about BOS token and stuff… Thank you

Topic		Replies	Views
Prediction on Causal Language models Beginners	0	27	December 22, 2024
How data should be structured to Fine-Tune a CausalLM 🤗Transformers	1	600	June 28, 2024
GPT-2 shift logits and labels 🤗Transformers	5	5827	May 12, 2023
Where does the Transformers do the target text shifting in causal LM? Beginners	4	4807	February 21, 2025
Source and target vs input and labels for causal autoregressive language models Beginners	1	1724	July 27, 2022

How to label dataset for Causal Language Modeling

Related topics