Dataset format standards for chat-based, fine-tuned Llama models

ccruttjr · January 25, 2024, 9:03pm

I want to use a Llama-based model for text generation/chat-bot. I have my own data, and was curious how to format it as to get the best results out of my fine-tuning. Currently, I use "[PCP] " and "[SR] " to separate who is talking. Here’s a snippet of my code to do this, followed by an example of two conversations I might have in my dataset.

    conversation = ""
    for _, row in group.iterrows():
        if row.get("PCP_MESSAGE", "").strip():
            clean_pcp_message = textFromHtml(row["PCP_MESSAGE"])
            conversation += "[PCP] " + clean_pcp_message + " "
        if row.get("SR_MESSAGE", "").strip():
            clean_sr_message = textFromHtml(row["SR_MESSAGE"])
            conversation += "[SR] " + clean_sr_message + " "
    
    print(conversation)
    
    return conversation

[PCP] Some text some text some text [SR] Dear Lorem, ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est

laborum. [PCP] Hello Lorem,

ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco. [SR] Hi __,

&

Good morning,

Thanks for getting back to GI.

Given the clinical symptom along with patient’s age and documented BRBPR, GI will evaluate patient in GI clinic to consider scheduling diagnostic EGD/colonoscopy evaluation.
&
Best [PCP] Great!

Thank you.

and (no data is real)…

[PCP] 45 yo male already on Pantoprazole , has recurrence , worsening GERD with esophagitis.

addendum: update:01/01/1970. patient tested negative for H.pylori in the stool but still having a lot of abdominal gas and eructation [SR] Dr Smith

what are the pts GERD and “esophagitis” symptoms.

MMK

This is the only place I’ve seen something that has a little documentation on how to format data depending on what model and what task you are using. A question was asked here but never answered

It says use this for chat-based

[INST]<<SYS>>
You are a friendly chatbot that gives helpful answers
<</SYS>>

Hello[/INST]Hello, how are you?</s><s>[INST]Good, please tell me what 1+1 is.[/INST]1+1=2. Please let me know if you need anything else!</s>

But I am curious if there is any other documentation out there that details how and why data should be formatted a certain way for a certain type of model and task.

severo · January 26, 2024, 8:39am

maybe ChatML, as proposed in How to Fine-Tune LLMs in 2024 with Hugging Face?

ccruttjr · February 16, 2024, 7:53pm

I FOUND IT

Utilities for Tokenizers (huggingface.co)

Here is an example from here Templates for Chat Models (huggingface.co)

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)  # You may want to use bfloat16 and/or move to GPU here

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

Include add_generation_prompt if you are using inference. false if training/fine-tuning.

This will automatically format your messages into that datasets most conducive format :)))))))))))))

Topic		Replies	Views
Llama-2-7b-chat fine-tuning Models	4	6759	April 26, 2024
Confusion regarding when to use dict-styled chat dialogue vs. when to format using chat template Intermediate	0	42	November 6, 2024
Fine Tuning with Alpaca vs Chat Template Beginners	0	510	December 12, 2024
Question about llama fine tuning dataset token string Beginners	1	13	May 17, 2025
LLaMa2 fine-tuning: Multi-turn conversation dataset template Models	2	5211	March 6, 2024

Dataset format standards for chat-based, fine-tuned Llama models

Related topics