Best practice for usage of Data Collator For CompletionOnlyLM in multi-turn chat

varad-a8 · July 25, 2024, 12:56am

I want to SFT a multi-turn dataset using LLaMA 3 8B. An example data sample is in the following format:

[
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "tool", "content": "..."},
{"role": "assistant", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."},
]

I am planning to use DataCollatorForCompletionOnlyLM for masking out content of roles that are not assistant. The documentation recommends to provide instruction and response templates for this task. However, it works when you have [INST] [/INST] tags to distinguish when an instruction starts and ends, which is not the case in the above example (LLaMA makes use of <|start_header_id|>role<|end_header_id|>). So, how do I use DataCollatorForCompletionOnlyLM for masking out all roles except assistant in the above scenario?

Thank you!

KYLN24 · May 25, 2025, 9:49am

same problem

John6666 · May 25, 2025, 10:56am

If the appropriate configuration file is placed in the repository, apply_chat_template() should work fine. If not, it will probably be ChatML equivalent.

github.com/huggingface/trl

How does SFTTrainer handle instruction formatted datasets when a tokenizer has no chat_template?

opened 06:08PM - 16 Jan 24 UTC

closed 05:22PM - 17 Jan 24 UTC

JohnGiorgi

Hi! I am interested in using the `SFTTrainer` for instruction-tuning. Following …[the docs](https://huggingface.co/docs/trl/main/en/sft_trainer#dataset-format-support), I can see that I can provided examples in the following format to have the trainer format things for me: ```json {"prompt": "<prompt text>", "completion": "<ideal generated text>"} {"prompt": "<prompt text>", "completion": "<ideal generated text>"} {"prompt": "<prompt text>", "completion": "<ideal generated text>"} ``` The docs also say: > The [SFTTrainer](https://huggingface.co/docs/trl/main/en/trainer#trl.SFTTrainer) will then format the dataset for you using the defined format from the model’s tokenizer with the [apply_chat_template](https://huggingface.co/docs/transformers/main/en/chat_templating#templates-for-chat-models) method. My question and confusion is, what does the trainer do if the tokenizer has no `chat_template`, as is the case with the [base llama model](https://huggingface.co/meta-llama/Llama-2-13b-hf/blob/main/tokenizer_config.json)?

gist.github.com

https://gist.github.com/Blaizzy/40de0f6b4340490e3920db9e182e6455

DataCollatorForCompletionOnlyLM

```python
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from transformers import AutoTokenizer
from datasets import load_dataset

# Load Dataset and tokenizer
dataset = load_dataset('prince-canuma/tinyOrca', split='train')
tokenizer = AutoTokenizer.from_pretrained("prince-canuma/Damysus-2.7B-Chat")

This file has been truncated. show original

Topic		Replies	Views
SFT Trainer and chat templates Beginners	3	391	March 26, 2025
When to use a DataCollator for SFTTrainer Beginners	1	745	March 15, 2025
Question about llama fine tuning dataset token string Beginners	1	14	May 17, 2025
Tokenizer causes TRL completion data collator failure Intermediate	0	331	March 3, 2024
LLaMa2 fine-tuning: Multi-turn conversation dataset template Models	2	5284	March 6, 2024

Best practice for usage of Data Collator For CompletionOnlyLM in multi-turn chat

Related topics