SFT for chatbot - `text` column

Hi all

According to the Hugging Face documentation, this is the format for SFT: human: ... bot: ...

However, according to the example dataset the string in the text column should be in the following format:
### Human: ... ### Assistant: ...

When opening a new AutoTrain project, the following column mapping is requested: {“text”: “text”}

So, my questions:

  1. Should it be ### Human: ... ### Assistant: ... or human: ... bot: ...? (human or Human, Assistant or bot, case-senstive, with/out ###?)
  2. I also so in the example dataset a chaining of human-assistant pairs. In what case would I use this instead of just one pair?
  3. What do I do if I want to add context? Basically, I want to fine-tune a model in order to create a custom chatbot that is fine-tuned on conversations. Therefore, I need to input not only a single question and and answer but rather add all the conversation that preceded a specific question as context.
  4. Is there a way to input a chain-of-thought input in the form of a JSON? For example: {"conversation_context": conversation_context, "customer_question": customer_question, "assistant_response": assistant_response}?

Many thanks :pray:t3:

1 Like

Hello @nadav-n,

1- Format to follow

In my understanding the documentation only tells that the data should be in a format of a conversation between the human and the agent. You don’t have to follow any specific format as long as your data is consistent across samples.

In that regard, both human: ... bot: ... and ### Human: ... ### Assistant: ... works fine :slight_smile:

2/3- Chaining and adding context

(I’m not an expert of AutoTrain with SFT so I invite you to double check my statements)

I could see multiple reasons of why having multiple human-assistant pairs within a single sample but one could simply be providing more context to the model in order to influence the expected output.

This should also answer your third question, to add more context regarding the conversation I would simply provide the chat history to the model following the same format used (i.e ### Human: ... ### Assistant: ...). Or if you simply want to provide some context I guess you could simply prepend your sample with something like ### Context: ...

Hope that helped :slight_smile:

1 Like

here is an example dataset: autotrain-example-datasets/alpaca1k.csv at main · huggingface/autotrain-example-datasets · GitHub