Hi,
I’m building a Hugging Face Dataset
from scratch from a TSV file. For this, my initial approach was to initialize a Dataset
, then iterate over the TSV file to create a dictionary and then use the add_item()
method to append the new entry to the Dataset
. However, this has turned out to be quite slow. Here is a snippet of my code:
from datasets import ClassLabel, Dataset, Features, Value, Sequence
import pyarrow as pa
# Initialize Dataset
dataset = Dataset(
pa.table({
"tokens": [],
"ner_tags": []
})
)
# Some more code here to parse the TSV file into lines
for line in lines:
# Some more code here to create a dataset entry
dataset_entry = {
"tokens": tokens,
"ner_tags": tags
}
# dataset.add_item() # This is fast but you have to assign the dataset again as below to actually update `dataset`
dataset = dataset.add_item(dataset_entry) # This is very slow
Alternatively, I later modified the loop above to build a Python list of dataset_entry
items, and then used Dataset.from_list()
to create the datase using such listt. That was much faster compared to the approach above.
Why is dataset = dataset.add_item(dataset_entry)
so slow? Is there a way of making it faster and comparable to the .from_list()
method?
Thanks