I want to SFT a multi-turn dataset using LLaMA 3 8B. An example data sample is in the following format:
[
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "tool", "content": "..."},
{"role": "assistant", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."},
]
I am planning to use DataCollatorForCompletionOnlyLM for masking out content of roles that are not assistant. The documentation recommends to provide instruction and response templates for this task. However, it works when you have [INST] [/INST]
tags to distinguish when an instruction starts and ends, which is not the case in the above example (LLaMA makes use of <|start_header_id|>role<|end_header_id|>
). So, how do I use DataCollatorForCompletionOnlyLM
for masking out all roles except assistant in the above scenario?
Thank you!
5 Likes
If the appropriate configuration file is placed in the repository, apply_chat_template()
should work fine. If not, it will probably be ChatML equivalent.
Hello I’m implementing a framework for fine-tuning various LLMs using the TRL library’s SFTTrainer. I have a question about how chat templates work:
When using SFTTrainer with datasets in the standard formats (with “messages” array or “prompt”/“completion” fields), does the trainer automatically apply the tokenizer’s chat_template? The documentation suggests it does.
For models whose tokenizers don’t have a chat_template attribute set (or it’s empty), what template does SFTTrainer apply by def…
opened 06:08PM - 16 Jan 24 UTC
closed 05:22PM - 17 Jan 24 UTC
Hi! I am interested in using the `SFTTrainer` for instruction-tuning. Following … [the docs](https://huggingface.co/docs/trl/main/en/sft_trainer#dataset-format-support), I can see that I can provided examples in the following format to have the trainer format things for me:
```json
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
```
The docs also say:
> The [SFTTrainer](https://huggingface.co/docs/trl/main/en/trainer#trl.SFTTrainer) will then format the dataset for you using the defined format from the model’s tokenizer with the [apply_chat_template](https://huggingface.co/docs/transformers/main/en/chat_templating#templates-for-chat-models) method.
My question and confusion is, what does the trainer do if the tokenizer has no `chat_template`, as is the case with the [base llama model](https://huggingface.co/meta-llama/Llama-2-13b-hf/blob/main/tokenizer_config.json)?
DataCollatorForCompletionOnlyLM
```python
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from transformers import AutoTokenizer
from datasets import load_dataset
# Load Dataset and tokenizer
dataset = load_dataset('prince-canuma/tinyOrca', split='train')
tokenizer = AutoTokenizer.from_pretrained("prince-canuma/Damysus-2.7B-Chat")
This file has been truncated. show original