Specifying a Sequence feature slows down the generation of a dataset

Hello,

I want to generate a huggingface dataset from a generator function. Part of the data I want to store are long arrays (in the order of 500,000 elements).

Here is an easy example

from datasets import Dataset
import numpy as np
np.random.seed(1000)
def generator():
    for _ in range(300):
        yield {"data": np.random.rand(100000).astype(np.float32)}

ds = Dataset.from_generator(generator=generator, writer_batch_size=100)

This finishes in about one second. But when I specify the feature of the “data” column, like here

from datasets import Dataset, Value, Features, Sequence
import numpy as np
np.random.seed(1000)
def generator():
    for _ in range(300):
        yield {"data": np.random.rand(100000).astype(np.float32)}

features = {"data": Sequence(Value("float32"))}
ds2 = Dataset.from_generator(generator=generator, writer_batch_size=100, features=Features(features))

it takes 20 seconds, even though the automatically constructed feature in the first case is the same as the specified one in the second case.

Why is it slower in this case to specify the feature? And is there a way to get the speed of the first case when specifing the feature?

Thank you!

MacBook Pro M2
macOS 13.3
datasets 2.14.4

1 Like