IndexError: Invalid key: 16 is out of bounds for size 0

moschimonsters · June 29, 2023, 7:10pm

@A-Alsad for the falcon fine-tune code i tracked back the problem to the _prepare_dataset() function in the trl SFT Trainer (trl/trainer/sft_trainer.py). It changes the dataset column names from “text” to “input_ids” for the openassistant dataset. With my custom dataset it didn’t do this so my “text” column got removed (as described above) and therefore it was empty and producing the error. I have to investigate further but maybe this helps.

Edit1: I figured out the problem with my dataset. For me it had to do with the tokenizing in the _prepare_non_packed_dataloader() function in the SFTTrainer. If the max_seq_len is to high (i guess for the specific dataset) it can produce an empty input_batch. It warns about it but continues anyways and removes the original column. That’s basically the root of the error described above. Lowering the max_seq_len did the trick for me. I don’t know enough about the tokenization process, so maybe someone else could explain.

Edit2: The problems seems to be that the trl sft trainer filters out samples that are shorter than max_seq_len after tokenization. See the github issue posted below. So my fix above doesn not really solve the isseue.

Topic		Replies	Views
IndexError: Invalid key: 0 is out of bounds for size 0 🤗Datasets	0	532	April 4, 2024
SFTTrainer and wikitext dataset Beginners	0	206	March 16, 2024
Invalid Key Error when Training GPT2 from Scratch using trainer.train() 🤗Transformers	3	1519	April 15, 2024
Error training with iterabledatasets Beginners	1	616	July 22, 2022
IndexError: index out of bounds Beginners	1	974	January 23, 2022

IndexError: Invalid key: 16 is out of bounds for size 0

Related topics