Correct way to create a Dataset from a csv file

Hi, Could somebody please point me to a beginner’s tutorial that
would enable to load a csv file in a dataset for a finetuning task. I completed such a task as a learning experience using the “opus_books” dataset and my DatasetDict takes the following form:

books
DatasetDict({
train: Dataset({
features: [‘id’, ‘translation’],
num_rows: 127085
})
})

However, I’m struggling to get it right with a csv file. With the command

luganda_dataset = load_dataset(‘csv’, data_files=‘Luganda.csv’, cache_dir=’./’)
I end up with a DatasetDict in the following form:

luganda_dataset
DatasetDict({
train: Dataset({
features: [‘en’, ‘lg’, ‘Unnamed: 2’, ‘Unnamed: 3’],
num_rows: 16000
})
})
This is clearly not right and I am at a loss as to what I need to do.
I would welcome some guidance on what I have to do to create the Dataset correctly.
Thank you.
Terence

Did you check your dataset’s features?

Maybe check first by reading it with a pandas dataframe, and if it works, do

> from datasets import Dataset 
> import pandas as pd 
> df = pd.DataFrame({"a": [1, 2, 3]}) 
> dataset = Dataset.from_pandas(df)

But this is just a workaround to debug.

I’m aware of the reason for ‘Unnamed:2’ and ‘Unnamed 3’ - each row of the csv file ended with “,”. However, I am still getting the column names “en” and “lg” as features when the features should be “id” and “translation”. I would really welcome some guidance on this, please.

Well, what is the first line of your csv file? What’s the output of

head -n 3 Luganda.csv

(venv) tel34@moses:~/nmtgateway/Helsinki-NLP/finetuning$ head -n 3 Luganda.csv
en,lg
All refugees were requested to register with the chairman.,Abanoonyiboobubudamu bonna baasabiddwa beewandiise ewa ssentebe.
They called for a refugees’ meeting yesterday.,Baayise olukungaana lw’abanoonyiboobubudamu eggulo.

So as you can see, the titles of the columns in your csv are indeed en and lg… I don’t understand your problem. The feature names are the column names.

Well, when I run the fine-tuning script, which works perfectly well with fine-tuning t5-small with the “opus_books” dataset, I get the following KeyError:
Traceback (most recent call last):
File “finetune_luganda.py”, line 24, in
tokenized_luganda = luganda_dataset.map(preprocess_function, batched=True)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/dataset_dict.py”, line 512, in map
for k, dataset in self.items()
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/dataset_dict.py”, line 512, in
for k, dataset in self.items()
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2120, in map
desc=desc,
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 518, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 485, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/fingerprint.py”, line 413, in wrapper
out = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2485, in _map_single
offset=offset,
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2367, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2062, in decorated
result = f(decorated_item, *args, **kwargs)
File “finetune_luganda.py”, line 16, in preprocess_function
inputs = [prefix + example[source_lang] for example in examples[“translation”]]
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 123, in getitem
values = super().getitem(key)
File “/usr/lib/python3.6/collections/init.py”, line 991, in getitem
raise KeyError(key)
KeyError: ‘translation’

Hi! These are the features of the opus_books dataset:

features=datasets.Features(
    {
        "id": datasets.Value("string"),
        "translation": datasets.Translation(languages=(self.config.lang1, self.config.lang2)),
    },
),

You can align your dataset with it as follows (after loading):

ds.map(
    lambda ex, i: {"id": i, "translation": dict(ex)}, 
    remove_columns=["en", "lg"],
    features=Features({"id": datasets.Value("string"), "translation": datasets.Translation(languages=["en", "lg"])}), 
    with_indices=True,
)
1 Like

Thank you for taking the trouble to answer my query :-). I have incorporated your suggestion in my script yet I am still getting KeyError(‘translation’). See below with a Traceback:
import datasets
from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
luganda_dataset = load_dataset(“csv”, data_files=“Luganda.csv”)
luganda_dataset.map(lambda ex, i: {“id”: i, “translation”: dict(ex)}, remove_columns=[“en”, “lg”], features=datasets.Features({“id”: datasets.Value(“string”), “translation”: datasets
.Translation(languages=[“en”, “lg”])}), with_indices=True,)
luganda_dataset = luganda_dataset[“train”].train_test_split(test_size=0.2)
tokenizer = AutoTokenizer.from_pretrained("./opus-mt-en-lg")
source_lang = “en”
target_lang = “lg”
prefix = "translate English to Luganda: "

def preprocess_function(examples):
inputs = [prefix + example[source_lang] for example in examples[“translation”]]
targets = [example[target_lang] for example in examples[“translation”]]
model_inputs = tokenizer(inputs, max_length=128, truncation=True)
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=128, truncation=True)
model_inputs[“labels”] = labels[“input_ids”]
return model_inputs

tokenized_luganda = luganda_dataset.map(preprocess_function, batched=True)
model = AutoModelForSeq2SeqLM.from_pretrained(“t5-small”)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

training_args = Seq2SeqTrainingArguments(
output_dir="./results",
evaluation_strategy=“epoch”,
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=10,
fp16=True,
)

trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_luganda[“train”],
eval_dataset=tokenized_luganda[“test”],
tokenizer=tokenizer,
data_collator=data_collator,
)

trainer.train()

Traceback (most recent call last)
File “finetune_luganda.py”, line 28, in
tokenized_luganda = luganda_dataset.map(preprocess_function, batched=True)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/dataset_dict.py”, line 512, in map
for k, dataset in self.items()
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/dataset_dict.py”, line 512, in
for k, dataset in self.items()
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2120, in map
desc=desc,
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 518, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 485, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/fingerprint.py”, line 413, in wrapper
out = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2485, in _map_single
offset=offset,
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2367, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2062, in decorated
result = f(decorated_item, *args, **kwargs)
File “finetune_luganda.py”, line 20, in preprocess_function
inputs = [prefix + example[source_lang] for example in examples[“translation”]]
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 123, in getitem
values = super().getitem(key)
File “/usr/lib/python3.6/collections/init.py”, line 991, in getitem
raise KeyError(key)
KeyError: ‘translation’

You need to assign the actual value returned by map, so replace:

luganda_dataset.map(lambda ex, i: {“id”: i, “translation”: dict(ex)}, remove_columns=[“en”, “lg”], features=datasets.Features({“id”: datasets.Value(“string”), “translation”: datasets
.Translation(languages=[“en”, “lg”])}), with_indices=True,)

with

luganda_dataset = luganda_dataset.map(lambda ex, i: {“id”: i, “translation”: dict(ex)}, remove_columns=[“en”, “lg”], features=datasets.Features({“id”: datasets.Value(“string”), “translation”: datasets
.Translation(languages=[“en”, “lg”])}), with_indices=True,)

Yes, that was my silly error, sorry :frowning: The script now goes further but throws the following: TypeError: must be str, not NoneType
at Line 16. The only significant difference with my script for the “opus books” dataset which worked perfectly is that here I am reading in a csv file.

Traceback (most recent call last):
File “finetune_luganda.py”, line 25, in
tokenized_luganda = luganda_dataset.map(preprocess_function, batched=True)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/dataset_dict.py”, line 512, in map
for k, dataset in self.items()
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/dataset_dict.py”, line 512, in
for k, dataset in self.items()
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2120, in map
desc=desc,
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 518, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 485, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/fingerprint.py”, line 413, in wrapper
out = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2485, in _map_single
offset=offset,
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2367, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2062, in decorated
result = f(decorated_item, *args, **kwargs)
File “finetune_luganda.py”, line 16, in preprocess_function
inputs = [prefix + example[source_lang] for example in examples[“translation”]]
File “finetune_luganda.py”, line 16, in
inputs = [prefix + example[source_lang] for example in examples[“translation”]]

Okay, this means you have None values under the source_lang key, which can happen if your CSV file is missing English translations in some lines, i.e., you have lines like this:

,<lg_translation>

OK, that has been resolved. The script now moves on further to produce the following error:
“Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.”
I have no idea what input is invalid here. As mentioned at the outset the script worked with the “opus_books” dataset.

Traceback (most recent call last):
File “finetune_luganda.py”, line 25, in
tokenized_luganda = luganda_dataset.map(preprocess_function, batched=True)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/dataset_dict.py”, line 512, in map
for k, dataset in self.items()
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/dataset_dict.py”, line 512, in
for k, dataset in self.items()
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2120, in map
desc=desc,
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 518, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 485, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/fingerprint.py”, line 413, in wrapper
out = func(self, *args, **kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2485, in _map_single
offset=offset,
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2367, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File “/home/tel34/venv/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2062, in decorated
result = f(decorated_item, *args, **kwargs)
File “finetune_luganda.py”, line 21, in preprocess_function
labels = tokenizer(targets, max_length=128, truncation=True)
File “/home/tel34/venv/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 2420, in call
**kwargs,
File “/home/tel34/venv/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 2605, in batch_encode_plus
**kwargs,
File “/home/tel34/venv/lib/python3.6/site-packages/transformers/tokenization_utils.py”, line 690, in _batch_encode_plus
first_ids = get_input_ids(ids)
File “/home/tel34/venv/lib/python3.6/site-packages/transformers/tokenization_utils.py”, line 671, in get_input_ids
“Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.”

The script now works with a modification by Yasmin Moslem. I post it below in case others have a simil;ar issue:

import datasets
from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from random import randrange

luganda_dataset = load_dataset("csv", data_files="Luganda.csv")
luganda_dataset = luganda_dataset["train"].map(lambda ex, i: {"id": i, "translation": dict(ex)}, remove_columns=["en", "lg"], features=datasets.Features({"id": datasets.Value("string"), "translation": datasets
.Translation(languages=["en", "lg"])}), with_indices=True,)
luganda_dataset = luganda_dataset.train_test_split(test_size=0.2)
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-lg")
source_lang = "en"
target_lang = "lg"
prefix = "translate English to Luganda: "

def preprocess_function(examples):
    inputs = []
    targets = []
    for example in examples["translation"]:
        if example[source_lang] is not None and example[target_lang] is not None and \
        len(example[source_lang].strip()) > 3 and len(example[target_lang].strip()) > 3:
            inputs.append(prefix + example[source_lang].strip())
            targets.append(example[target_lang].strip())
        else:
            "There is an issue with this segment:"
            print("Source:", example[source_lang])
            print("Target:", example[target_lang])
            random_num = randrange(10000)
            print("Replaced with", random_num)
            inputs.append(prefix + str(random_num))
            targets.append(str(random_num))
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True)
        model_inputs["labels"] = labels["input_ids"]
    return model_inputs

print(luganda_dataset.map)
tokenized_luganda = luganda_dataset.map(preprocess_function, batched=True)
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-lg")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_luganda["train"],
    eval_dataset=tokenized_luganda["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()