Hello HuggingFace community.
I’m encountering an error when attempting to fine-tune a language model from Hugging Face. I suspect the issue might be related to the format of my dataset. Could you please help me identify the cause of the error and suggest a solution?
Code:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset
model_name = "elyza/ELYZA-japanese-Llama-2-13b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Load your CSV dataset
train_dataset = load_dataset('csv', data_files=['./documents.csv'])
# Define training arguments
training_args = TrainingArguments(
output_dir='./results', # Output directory
num_train_epochs=3 # Adjust based on dataset size
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer
)
trainer.train()
CSV File (documents.csv):
Document Type | Text Content |
---|---|
Type1 | ContentText11 |
Type1 | ContentText12 |
Type2 | ContentText21 |
Error Message:
Traceback (most recent call last):
File "c:\learning\learning.py", line 23, in <module>
trainer.train()
File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\transformers\trainer.py", line 1780, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\transformers\trainer.py", line 2085, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\accelerate\data_loader.py", line 452, in __iter__
current_batch = next(dataloader_iter)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\torch\utils\data\dataloader.py", line 633, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\torch\utils\data\dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\torch\utils\data\_utils\fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\torch\utils\data\_utils\fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
~~~~~~~~~~~~^^^^^
File "C:\Users\airep\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\datasets\dataset_dict.py", line 81, in __getitem__
raise KeyError(
KeyError: "Invalid key: 0. Please first select a split. For example: `my_dataset_dictionary['train'][0]`. Available splits: ['train']"
0%| | 0/3 [00:00<?, ?it/s]
Thank you!