Instruction tuning llm

I want to fine-tune a LLM with an instructions dataset, which consists of pairs of prompts and completions. I have seen a lot of tutorials on how to fine-tune LLMs with supervised datasets. Almost all of them use Trainer or SFTTrainer from Hugging Face.

The strange thing that shocked me is that there is no difference between this fine-tuning and the pretraining process; in both cases, the model tries to predict the next token for both the prompt and the completion.

Intuitively, I would prefer to backpropagate only the tokens of the completion and not the prompt itself. In fact, I believe the next token prediction should only start at the completion stage. Does that make sense?

Does anyone know of any library that can perform training as I expect?

Hi,

That’s supported in the TRL library using the DataCollatorForCompletionOnlyLM class: Supervised Fine-tuning Trainer

1 Like

The requirement you want may be needs to deal with in data preprocess procedure. To the best of my knowledge,you can manually replace prompts part in lable with -100 which is a special token that would be ignore loss calculation by torch backend(most third party llm finetune repo do things like this, like llama-recipes officially supported by llama or ‘llamafactory’ a very famous llm factory).

1 Like

Can you help me check how training data is generated when entering the “text” column of the dataset into the trainer() function? I mean test with code. thanks

@cungnlp this can be checked by doing trainer.get_train_dataloader. You can then check some samples of the dataloader:

train_dataloader = trainer.get_train_dataloader()

batch = next(iter(train_dataloader))
print(batch)

Thank you for your help!

Vào Th 4, 24 thg 1, 2024 vào lúc 04:41 Niels Rogge via Hugging Face Forums <notifications@hellohellohello.discoursemail.com> đã viết:

I checked, but it seems the dataset doesn’t look like “shift right one token”. Can you explain to me why? By the way, can you give me the code on how to add your own dataset with 3 columns: input_ids, attention_mask, label.
We look forward to receiving your feedback!

Vào Th 4, 24 thg 1, 2024 vào lúc 10:03 Nguyen Cung <cungmachinelearning@gmail.com> đã viết:

Hi,

For LLMs in the Transformers library, the labels are typically just a copy of the input_ids (with padding tokens replaced by -100, the ignore index of the cross-entropy loss in PyTorch). The model will internally shift the labels one position to the right.

Thanks for your help, I understood the problem. Wish you an effective working day.

Vào CN, 5 thg 5, 2024 vào lúc 15:50 Niels Rogge via Hugging Face Forums <notifications@hellohellohello.discoursemail.com> đã viết: