Dataset format standards for chat-based, fine-tuned Llama models

I want to use a Llama-based model for text generation/chat-bot. I have my own data, and was curious how to format it as to get the best results out of my fine-tuning. Currently, I use "[PCP] " and "[SR] " to separate who is talking. Here’s a snippet of my code to do this, followed by an example of two conversations I might have in my dataset.

    conversation = ""
    for _, row in group.iterrows():
        if row.get("PCP_MESSAGE", "").strip():
            clean_pcp_message = textFromHtml(row["PCP_MESSAGE"])
            conversation += "[PCP] " + clean_pcp_message + " "
        if row.get("SR_MESSAGE", "").strip():
            clean_sr_message = textFromHtml(row["SR_MESSAGE"])
            conversation += "[SR] " + clean_sr_message + " "
    
    print(conversation)
    
    return conversation

[PCP] Some text some text some text [SR] Dear Lorem, ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est

laborum. [PCP] Hello Lorem,

ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco. [SR] Hi __,

&

Good morning,

Thanks for getting back to GI.

Given the clinical symptom along with patient’s age and documented BRBPR, GI will evaluate patient in GI clinic to consider scheduling diagnostic EGD/colonoscopy evaluation.
&
Best [PCP] Great!

Thank you.

and (no data is real)…

[PCP] 45 yo male already on Pantoprazole , has recurrence , worsening GERD with esophagitis.

addendum: update:01/01/1970. patient tested negative for H.pylori in the stool but still having a lot of abdominal gas and eructation [SR] Dr Smith

what are the pts GERD and ‚Äúesophagitis‚ÄĚ symptoms.

MMK

This is the only place I’ve seen something that has a little documentation on how to format data depending on what model and what task you are using. A question was asked here but never answered

It says use this for chat-based

[INST]<<SYS>>
You are a friendly chatbot that gives helpful answers
<</SYS>>

Hello[/INST]Hello, how are you?</s><s>[INST]Good, please tell me what 1+1 is.[/INST]1+1=2. Please let me know if you need anything else!</s>

But I am curious if there is any other documentation out there that details how and why data should be formatted a certain way for a certain type of model and task.

maybe ChatML, as proposed in How to Fine-Tune LLMs in 2024 with Hugging Face?

I FOUND IT

Utilities for Tokenizers (huggingface.co)

Here is an example from here Templates for Chat Models (huggingface.co)

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)  # You may want to use bfloat16 and/or move to GPU here

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

Include add_generation_prompt if you are using inference. false if training/fine-tuning.

This will automatically format your messages into that datasets most conducive format :)))))))))))))