Specifying a Sequence feature slows down the generation of a dataset

bbiltzing · September 6, 2023, 8:05am

Hello,

I want to generate a huggingface dataset from a generator function. Part of the data I want to store are long arrays (in the order of 500,000 elements).

Here is an easy example

from datasets import Dataset
import numpy as np
np.random.seed(1000)
def generator():
    for _ in range(300):
        yield {"data": np.random.rand(100000).astype(np.float32)}

ds = Dataset.from_generator(generator=generator, writer_batch_size=100)

This finishes in about one second. But when I specify the feature of the “data” column, like here

from datasets import Dataset, Value, Features, Sequence
import numpy as np
np.random.seed(1000)
def generator():
    for _ in range(300):
        yield {"data": np.random.rand(100000).astype(np.float32)}

features = {"data": Sequence(Value("float32"))}
ds2 = Dataset.from_generator(generator=generator, writer_batch_size=100, features=Features(features))

it takes 20 seconds, even though the automatically constructed feature in the first case is the same as the specified one in the second case.

Why is it slower in this case to specify the feature? And is there a way to get the speed of the first case when specifing the feature?

Thank you!

MacBook Pro M2
macOS 13.3
datasets 2.14.4

Topic		Replies	Views
Create dataset consisting of numpy arrays, Sequence or ArrayND? 🤗Datasets	1	160	October 24, 2024
Setting dataset feature value as numpy array 🤗Datasets	7	7974	November 14, 2023
Using datasets.from_generator() with a column that is an embedding Beginners	2	20	August 20, 2025
Intention of the `length` field in class datasets.Sequence? 🤗Datasets	1	291	March 23, 2023
Generating Vocabulary using Datasets 🤗Datasets	1	1462	August 30, 2022

Specifying a Sequence feature slows down the generation of a dataset

Related topics