Adding items to Dataset is slow compared to loading from Python list

codingexplorer · March 27, 2024, 9:47pm

Hi,

I’m building a Hugging Face Dataset from scratch from a TSV file. For this, my initial approach was to initialize a Dataset, then iterate over the TSV file to create a dictionary and then use the add_item() method to append the new entry to the Dataset. However, this has turned out to be quite slow. Here is a snippet of my code:

from datasets import ClassLabel, Dataset, Features, Value, Sequence
import pyarrow as pa

# Initialize Dataset
dataset = Dataset(
    pa.table({
        "tokens": [],
        "ner_tags": []
    })
)

# Some more code here to parse the TSV file into lines

for line in lines:
    # Some more code here to create a dataset entry
    dataset_entry = {
        "tokens": tokens,
        "ner_tags": tags
    }

    # dataset.add_item() # This is fast but you have to assign the dataset again as below to actually update `dataset`
    dataset = dataset.add_item(dataset_entry) # This is very slow

Alternatively, I later modified the loop above to build a Python list of dataset_entry items, and then used Dataset.from_list() to create the datase using such listt. That was much faster compared to the approach above.

Why is dataset = dataset.add_item(dataset_entry) so slow? Is there a way of making it faster and comparable to the .from_list() method?

Thanks

lhoestq · April 3, 2024, 12:59pm

Hi ! Dataset objects store their data in Arrow format, which is a columnar format. It means it’s fast to add batches of data and slow to add rows of data. Creating from a list of data is also fast (you can see it as a big batch)

Topic		Replies	Views
Generating Vocabulary using Datasets 🤗Datasets	1	1463	August 30, 2022
Loading dataset from disk taking more time than expected 🤗Datasets	0	721	March 14, 2022
Loading list as dataset Beginners	4	19140	June 14, 2024
Adding data to empty dataset object 🤗Datasets	3	3505	February 10, 2022
Trying to Build Datasets, Random Items Get Added Beginners	2	484	July 27, 2021

Adding items to Dataset is slow compared to loading from Python list

Related topics