Hugging Face Forums

How to fine-tune an LLM with AutoTrain?

jasonshahmf May 8, 2023, 10:34pm 1

Hi,
This document indicates that one can fine-tune an LLM with AutoTrain using CLM.
I have a dataset that is formatted as:
{ “instruction”: “xxx”, “input”: “yyy”, “output”: “zzz”} tuples.

When attempting to create a new AutoTrain project, I’m not sure which option to choose to be able to train the model with these tuples using CLM.

Any suggestions would be helpful. Thank you!

abhishek May 9, 2023, 4:28am 2

you seem to have a jsonl file, you can convert it to CSV using python:

import pandas as pd

# read the JSONL file into a pandas DataFrame
df = pd.read_json('input.jsonl', lines=True)

# write the DataFrame to a CSV file
df.to_csv('output.csv', index=False)

regarding column mapping, it seems almost the same example is provided as an example:

jasonshahmf May 9, 2023, 4:25pm 3

Sorry, I should have pasted this picture.
Which of the model types map to training the model with CLM?

abhishek May 9, 2023, 4:37pm 4

llm is only available in autotrain advanced: AutoTrain

Storm-AIM June 18, 2023, 9:42am 5

Are there specific resource requirements for the hugging face space environment in order to proceed with LLM Finetuning?

1 Like

nadav-n March 3, 2024, 5:24pm 6

Hi @abhishek ,
I have a follow-up question please.

According to the Hugging Face documentation, this is the format for SFT: human: ... bot: ...

However, according to the example dataset the string in the text column should be in the following format:
### Human: ... ### Assistant: ...

When opening a new AutoTrain project, the following column mapping is requested: {“text”: “text”}

So, my questions:

Should it be ### Human: ... ### Assistant: ... or human: ... bot: ...? (human or Human, Assistant or bot, case-senstive, with/out ###?)
I also so in the example dataset a chaining of human-assistant pairs. In what case would I use this instead of just one pair?
What do I do if I want to add context? Basically, I want to fine-tune a model in order to create a custom chatbot that is fine-tuned on conversations. Therefore, I need to input not only a single question and and answer but rather add all the conversation that preceded a specific question as context.
Is there a way to input a chain-of-thought input in the form of a JSON? For example: {"conversation_context": conversation_context, "customer_question": customer_question, "assistant_response": assistant_response}?

Many thanks

Topic		Replies	Views	Activity
AutoTrain csv data format 🤗AutoTrain	9	4414	March 21, 2024
Autotrain LLM fine tuning data mapping problem 🤗AutoTrain	0	483	July 5, 2023
SFT for chatbot - `text` column 🤗AutoTrain	2	815	March 8, 2024
How do I format the column mapping on the autotrainer? Beginners	2	153	January 24, 2025
autoTrain data format for SFT fine tuning 🤗AutoTrain	0	44	August 30, 2024