I want to use a Llama-based model for text generation/chat-bot. I have my own data, and was curious how to format it as to get the best results out of my fine-tuning. Currently, I use "[PCP] " and "[SR] " to separate who is talking. Here’s a snippet of my code to do this, followed by an example of two conversations I might have in my dataset.
conversation = ""
for _, row in group.iterrows():
if row.get("PCP_MESSAGE", "").strip():
clean_pcp_message = textFromHtml(row["PCP_MESSAGE"])
conversation += "[PCP] " + clean_pcp_message + " "
if row.get("SR_MESSAGE", "").strip():
clean_sr_message = textFromHtml(row["SR_MESSAGE"])
conversation += "[SR] " + clean_sr_message + " "
print(conversation)
return conversation
[PCP] Some text some text some text [SR] Dear Lorem, ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est
laborum. [PCP] Hello Lorem,
ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco. [SR] Hi __,
&
Good morning,
Thanks for getting back to GI.
Given the clinical symptom along with patient’s age and documented BRBPR, GI will evaluate patient in GI clinic to consider scheduling diagnostic EGD/colonoscopy evaluation.
&
Best [PCP] Great!Thank you.
and (no data is real)…
[PCP] 45 yo male already on Pantoprazole , has recurrence , worsening GERD with esophagitis.
addendum: update:01/01/1970. patient tested negative for H.pylori in the stool but still having a lot of abdominal gas and eructation [SR] Dr Smith
what are the pts GERD and “esophagitis” symptoms.
MMK
This is the only place I’ve seen something that has a little documentation on how to format data depending on what model and what task you are using. A question was asked here but never answered
It says use this for chat-based
[INST]<<SYS>>
You are a friendly chatbot that gives helpful answers
<</SYS>>Hello[/INST]Hello, how are you?</s><s>[INST]Good, please tell me what 1+1 is.[/INST]1+1=2. Please let me know if you need anything else!</s>
But I am curious if there is any other documentation out there that details how and why data should be formatted a certain way for a certain type of model and task.