How to fine-tune an LLM with AutoTrain?

Hi,
This document indicates that one can fine-tune an LLM with AutoTrain using CLM.
I have a dataset that is formatted as:
{ “instruction”: “xxx”, “input”: “yyy”, “output”: “zzz”} tuples.

When attempting to create a new AutoTrain project, I’m not sure which option to choose to be able to train the model with these tuples using CLM.

Any suggestions would be helpful. Thank you!

you seem to have a jsonl file, you can convert it to CSV using python:

import pandas as pd

# read the JSONL file into a pandas DataFrame
df = pd.read_json('input.jsonl', lines=True)

# write the DataFrame to a CSV file
df.to_csv('output.csv', index=False)

regarding column mapping, it seems almost the same example is provided as an example:

Sorry, I should have pasted this picture.
Which of the model types map to training the model with CLM?
image

llm is only available in autotrain advanced: AutoTrain

Are there specific resource requirements for the hugging face space environment in order to proceed with LLM Finetuning?

1 Like

Hi @abhishek ,
I have a follow-up question please.

According to the Hugging Face documentation, this is the format for SFT: human: ... bot: ...

However, according to the example dataset the string in the text column should be in the following format:
### Human: ... ### Assistant: ...

When opening a new AutoTrain project, the following column mapping is requested: {“text”: “text”}

So, my questions:

  1. Should it be ### Human: ... ### Assistant: ... or human: ... bot: ...? (human or Human, Assistant or bot, case-senstive, with/out ###?)
  2. I also so in the example dataset a chaining of human-assistant pairs. In what case would I use this instead of just one pair?
  3. What do I do if I want to add context? Basically, I want to fine-tune a model in order to create a custom chatbot that is fine-tuned on conversations. Therefore, I need to input not only a single question and and answer but rather add all the conversation that preceded a specific question as context.
  4. Is there a way to input a chain-of-thought input in the form of a JSON? For example: {"conversation_context": conversation_context, "customer_question": customer_question, "assistant_response": assistant_response}?

Many thanks :pray:t3: