SFTTrain and datasets, my head hurts

SuperbEmphasis · February 11, 2025, 7:14pm

I am hoping someone can point out something I am doing that is obvious, because I have no clue what I am doing.

I have made a paraquet dataset. I generated a full prompt, and then I split that prompt into two columns, prompt and completion, which consists of an array of each item.

To make the dataframe, I am using pandas. I built two arrays, prompt, and completion and put them in the same key.

df = pd.DataFrame({'prompt': prompt ,
                   'completion': completetion},
                  )

table = pa.Table.from_pandas(df)
pq.write_table(table, './save/sft.parquet')

Then I am using:

To try to just get it to run…

Then I load the dataset. I printed the column names as I am trying to figure out what the heck the errors mean

dataset = load_dataset("parquet", data_files="./save/sft.parquet")
print(dataset.column_names)  # This will now work correctly.

...
Mostly using the rest of the sft_trainer example

I get the column names back from my print statement (No clue if this is good or bad)

{'train': ['prompt', 'completion']}

And lastly, the error:

ValueError: Column to remove ['train'] not in the dataset. Current columns in the dataset: ['prompt', 'completion']

And I am not sure what the heck I am supposed to do! Should the array just be called “train”, and then I am supported to have a key named “text” which should have the JSON appropriate prompt/completion?

Thanks!

John6666 · February 12, 2025, 3:09am

If it’s a bug, it’s probably this.

github.com/unslothai/unsloth

SFTTrainer doesn't work with some datasets due to column key error

opened 08:17PM - 16 Mar 24 UTC

JohnnyRacer

Hello, I 've been trying to use the `SFTTrainer` with the [vicgalle/alpaca-gpt4]…(https://huggingface.co/datasets/vicgalle/alpaca-gpt4) dataset. However after prepping the dataset in the SFT format, I keep on getting this error when I initialize the trainer. ```python File /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:3025, in Dataset.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc) 3023 missing_columns = set(remove_columns) - set(self._data.column_names) 3024 if missing_columns: -> 3025 raise ValueError( 3026 f"Column to remove {list(missing_columns)} not in the dataset. Current columns in the dataset: {self._data.column_names}" 3027 ) 3029 load_from_cache_file = load_from_cache_file if load_from_cache_file is not None else is_caching_enabled() 3031 if fn_kwargs is None: ValueError: Column to remove ['train'] not in the dataset. Current columns in the dataset: ['instruction', 'input', 'output', 'text'] ``` However, the the dataset only has the `train` split when I print it. This only occurs with some datasets, I suspect this maybe a bug.

Anyway, I personally think it’s more likely that it’s not a bug but a confusion.

To put it simply, in order to get the Trainer of Hugging Face to work properly, you need to prepare a dataset that conforms to the Hugging Face datasets library in a specific format, but it’s not easy to get there…
It’s enough to give you a headache.
I think that many people use the following method from pandas. There are various from_~ functions, and most of the ones for the formats handled by Python are provided.

SuperbEmphasis · February 18, 2025, 5:42pm

Thanks for this! I believe this solved the issue. I just went straight to a json format with the prompt and answer. This seemed to work.

Topic		Replies	Views
Help using sfttrainer with data collator, peft, and tokenizer template Intermediate	0	129	July 23, 2024
Dataset format which will be given SFTTrainer 🤗Transformers	0	157	June 16, 2024
Uploading Dataset: GUI vs Python "Error" 🤗Datasets	4	444	February 15, 2023
No key 'messages' found 🤗Transformers	2	50	February 15, 2025
Error using SFTTrainer: Make sure that your dataset has enough samples to at least yield one packed sequence Beginners	9	2996	November 1, 2024

SFTTrain and datasets, my head hurts

Related topics