I am hoping someone can point out something I am doing that is obvious, because I have no clue what I am doing.
I have made a paraquet dataset. I generated a full prompt, and then I split that prompt into two columns, prompt and completion, which consists of an array of each item.
To make the dataframe, I am using pandas. I built two arrays, prompt, and completion and put them in the same key.
Then I load the dataset. I printed the column names as I am trying to figure out what the heck the errors mean
dataset = load_dataset("parquet", data_files="./save/sft.parquet")
print(dataset.column_names) # This will now work correctly.
...
Mostly using the rest of the sft_trainer example
I get the column names back from my print statement (No clue if this is good or bad)
{'train': ['prompt', 'completion']}
And lastly, the error:
ValueError: Column to remove ['train'] not in the dataset. Current columns in the dataset: ['prompt', 'completion']
And I am not sure what the heck I am supposed to do! Should the array just be called “train”, and then I am supported to have a key named “text” which should have the JSON appropriate prompt/completion?
Anyway, I personally think it’s more likely that it’s not a bug but a confusion.
To put it simply, in order to get the Trainer of Hugging Face to work properly, you need to prepare a dataset that conforms to the Hugging Face datasets library in a specific format, but it’s not easy to get there…
It’s enough to give you a headache.
I think that many people use the following method from pandas. There are various from_~ functions, and most of the ones for the formats handled by Python are provided.