KeyError During LLM Fine-Tuning - Error Related to Dataset Splits

yumekix · April 27, 2024, 7:13am

Hello HuggingFace community.

I’m encountering an error when attempting to fine-tune a language model from Hugging Face. I suspect the issue might be related to the format of my dataset. Could you please help me identify the cause of the error and suggest a solution?

Code:

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset

model_name = "elyza/ELYZA-japanese-Llama-2-13b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Load your CSV dataset
train_dataset = load_dataset('csv', data_files=['./documents.csv']) 

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',  # Output directory
    num_train_epochs=3       # Adjust based on dataset size
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)
trainer.train()

CSV File (documents.csv):

Document Type	Text Content
Type1	ContentText11
Type1	ContentText12
Type2	ContentText21

Error Message:

Traceback (most recent call last):
  File "c:\learning\learning.py", line 23, in <module>
    trainer.train()
  File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\transformers\trainer.py", line 1780, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\transformers\trainer.py", line 2085, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\accelerate\data_loader.py", line 452, in __iter__
    current_batch = next(dataloader_iter)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\torch\utils\data\dataloader.py", line 633, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\torch\utils\data\dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration     
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\torch\utils\data\_utils\fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\torch\utils\data\_utils\fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\datasets\dataset_dict.py", line 81, in __getitem__
    raise KeyError(
KeyError: "Invalid key: 0. Please first select a split. For example: `my_dataset_dictionary['train'][0]`. Available splits: ['train']"
  0%|          | 0/3 [00:00<?, ?it/s]

Thank you!

Topic		Replies	Views
Error in Autotrain Training Beginners	3	112	May 8, 2025
KeyError: 'loss' when fine-tuning a Transformer model Beginners	7	2463	July 12, 2022
Error while training my first HF model Beginners	0	234	October 23, 2023
Invalid key for dataset -- is this a bug with Trainers or with my code? Intermediate	1	690	July 24, 2023
Fine-tune a model on translation: Beginners	6	1862	July 17, 2021

KeyError During LLM Fine-Tuning - Error Related to Dataset Splits

Related topics