Adding items to Dataset is slow compared to loading from Python list

Hi,

I’m building a Hugging Face Dataset from scratch from a TSV file. For this, my initial approach was to initialize a Dataset, then iterate over the TSV file to create a dictionary and then use the add_item() method to append the new entry to the Dataset. However, this has turned out to be quite slow. Here is a snippet of my code:

from datasets import ClassLabel, Dataset, Features, Value, Sequence
import pyarrow as pa

# Initialize Dataset
dataset = Dataset(
    pa.table({
        "tokens": [],
        "ner_tags": []
    })
)

# Some more code here to parse the TSV file into lines

for line in lines:
    # Some more code here to create a dataset entry
    dataset_entry = {
        "tokens": tokens,
        "ner_tags": tags
    }

    # dataset.add_item() # This is fast but you have to assign the dataset again as below to actually update `dataset`
    dataset = dataset.add_item(dataset_entry) # This is very slow

Alternatively, I later modified the loop above to build a Python list of dataset_entry items, and then used Dataset.from_list() to create the datase using such listt. That was much faster compared to the approach above.

Why is dataset = dataset.add_item(dataset_entry) so slow? Is there a way of making it faster and comparable to the .from_list() method?

Thanks

Hi ! Dataset objects store their data in Arrow format, which is a columnar format. It means it’s fast to add batches of data and slow to add rows of data. Creating from a list of data is also fast (you can see it as a big batch)