SFTTrain and datasets, my head hurts

I am hoping someone can point out something I am doing that is obvious, because I have no clue what I am doing.

I have made a paraquet dataset. I generated a full prompt, and then I split that prompt into two columns, prompt and completion, which consists of an array of each item.

To make the dataframe, I am using pandas. I built two arrays, prompt, and completion and put them in the same key.

df = pd.DataFrame({'prompt': prompt ,
                   'completion': completetion},
                  )

table = pa.Table.from_pandas(df)
pq.write_table(table, './save/sft.parquet')

Then I am using:

To try to just get it to run…

Then I load the dataset. I printed the column names as I am trying to figure out what the heck the errors mean

dataset = load_dataset("parquet", data_files="./save/sft.parquet")
print(dataset.column_names)  # This will now work correctly.

...
Mostly using the rest of the sft_trainer example

I get the column names back from my print statement (No clue if this is good or bad)

{'train': ['prompt', 'completion']}

And lastly, the error:

ValueError: Column to remove ['train'] not in the dataset. Current columns in the dataset: ['prompt', 'completion']

And I am not sure what the heck I am supposed to do! Should the array just be called “train”, and then I am supported to have a key named “text” which should have the JSON appropriate prompt/completion?

Thanks!

1 Like

If it’s a bug, it’s probably this.

Anyway, I personally think it’s more likely that it’s not a bug but a confusion.

To put it simply, in order to get the Trainer of Hugging Face to work properly, you need to prepare a dataset that conforms to the Hugging Face datasets library in a specific format, but it’s not easy to get there…
It’s enough to give you a headache.
I think that many people use the following method from pandas. There are various from_~ functions, and most of the ones for the formats handled by Python are provided.

Thanks for this! I believe this solved the issue. I just went straight to a json format with the prompt and answer. This seemed to work.

1 Like