I am hoping someone can point out something I am doing that is obvious, because I have no clue what I am doing.
I have made a paraquet dataset. I generated a full prompt, and then I split that prompt into two columns, prompt and completion, which consists of an array of each item.
To make the dataframe, I am using pandas. I built two arrays, prompt, and completion and put them in the same key.
df = pd.DataFrame({'prompt': prompt ,
'completion': completetion},
)
table = pa.Table.from_pandas(df)
pq.write_table(table, './save/sft.parquet')
Then I am using:
To try to just get it to run…
Then I load the dataset. I printed the column names as I am trying to figure out what the heck the errors mean
dataset = load_dataset("parquet", data_files="./save/sft.parquet")
print(dataset.column_names) # This will now work correctly.
...
Mostly using the rest of the sft_trainer example
I get the column names back from my print statement (No clue if this is good or bad)
{'train': ['prompt', 'completion']}
And lastly, the error:
ValueError: Column to remove ['train'] not in the dataset. Current columns in the dataset: ['prompt', 'completion']
And I am not sure what the heck I am supposed to do! Should the array just be called “train”, and then I am supported to have a key named “text” which should have the JSON appropriate prompt/completion?
Thanks!