Specifying a Sequence feature slows down the generation of a dataset

Hello,

I want to generate a huggingface dataset from a generator function. Part of the data I want to store are long arrays (in the order of 500,000 elements).

Here is an easy example

from datasets import Dataset
import numpy as np
np.random.seed(1000)
def generator():
    for _ in range(300):
        yield {"data": np.random.rand(100000).astype(np.float32)}

ds = Dataset.from_generator(generator=generator, writer_batch_size=100)

This finishes in about one second. But when I specify the feature of the “data” column, like here

from datasets import Dataset, Value, Features, Sequence
import numpy as np
np.random.seed(1000)
def generator():
    for _ in range(300):
        yield {"data": np.random.rand(100000).astype(np.float32)}

features = {"data": Sequence(Value("float32"))}
ds2 = Dataset.from_generator(generator=generator, writer_batch_size=100, features=Features(features))

it takes 20 seconds, even though the automatically constructed feature in the first case is the same as the specified one in the second case.

Why is it slower in this case to specify the feature? And is there a way to get the speed of the first case when specifing the feature?

Thank you!

MacBook Pro M2
macOS 13.3
datasets 2.14.4

Hi!

In the first case where you don’t specify the features, the datasets library can use an optimized default feature structure for the data, which is faster to process. However, in the second case where you explicitly specify the feature structure, the library has to perform additional processing to match your specified feature structure. This additional processing can slow down the dataset creation.

To get the speed of the first case while specifying the feature, you can create a custom feature using Value without wrapping it in a Sequence. This way, you specify the data type directly without adding an extra level of sequence. Here’s how you can do it:

from datasets import Dataset, Value, Features
import numpy as np

np.random.seed(1000)

def generator():
    for _ in range(300):
        yield {"data": np.random.rand(100000).astype(np.float32)}

# Specify the feature directly without Sequence
features = {"data": Value("float32")}
ds2 = Dataset.from_generator(generator=generator, writer_batch_size=100, features=Features(features))

By specifying the feature this way, you should see improved speed similar to the first case, as you’re avoiding the additional overhead introduced by the unnecessary Sequence wrapping.

When I run this I get the following error:

TypeError: only size-1 arrays can be converted to Python scalars

If I understood the features correctly it expects the “data” to be a single float32 number instead of an array of float32 number if the feature is specified as features = {“data”: Value(“float32”)}.

Yeah, that could be an issue.

The error suggests that using Value("float32") attempts to cast the entire numpy array to a single scalar, causing the problem.

Let’s reconsider:

  1. We keep the Sequence since we are dealing with arrays.
  2. There could be an underlying inefficiency or bottleneck when specifying the features. This could be due to the way the library checks and validates the data against the provided features, leading to the slow performance.

Let’s try to use the Sequence feature type with an understanding of the length of the sequence.

from datasets import Dataset, Value, Features, Sequence
import numpy as np

np.random.seed(1000)

def generator():
    for _ in range(300):
        yield {"data": np.random.rand(100000).astype(np.float32)}

# Specify the sequence length in the feature definition
features = {"data": Sequence(feature=Value("float32"), length=100000)}
ds2 = Dataset.from_generator(generator=generator, writer_batch_size=100, features=Features(features))

If this does not improve the performance significantly, it’s either a potential inefficiency in the datasets library itself when explicitly specifying features or I don’t know, something else.

Hope this helps!

Thanks for your help, but unfortunately specifying the length does not solve the problem. The performance stays about the same.

Is there a reason why the ArrayXD feature is only defined for Dimensions >= 2? In a fork I implemented an Array1D (mostly just copied from Array2D) and when I use that my problem is solved. But I am not sure if this would cause problems further down the line after creating the dataset. I could of course just use a Array2D and transform my 1D array to a 2D array with shape (-1, 1). But this is a bit inelegant and would also cause a bit of overhead as I always have to retransform this when using it. So I am wondering why there is no native Array1D implementation?

Here is a link to the code on GitHub:

The Array2D with shape (-1,1) is also not an option, because then the cast_to_python_object function is very slow, because a list comprehension is done on any but the last dimension. In my case the array has the shape (500000, 1), which takes very long.

Again I think this could be solved by a real Array1D implantation. Or is there another solution to handle long 1D arrays efficiently?

If not, I could also prepare a PR which adds the Array1D feature.

1 Like

Indeed, the handling of NumPy could be more optimized. In particular, these lines seem to be the problem when encoding arrays to Sequence as each array item is encoded separately:
https://github.com/huggingface/datasets/blob/f2b028fd83d74e7701e7b8f2d87e740a989505a7/src/datasets/features/features.py#L1275-L1279.

We plan to introduce a Tensor type (backed by the recently introduced PyArrow Tensor type) soon to make handling of 1D arrays efficient (storage-wise it’s as efficient as Sequence(Value(<dtype>), length=np.prod(<shape>)))). In the meantime, generate the dataset using ds = Dataset.from_generator(generator=generator, writer_batch_size=100) and then cast it with ds = ds.cast_column("data", Sequence(Value("float32"), length=100000)) to avoid the slow encode_nested_example code path.

2 Likes

Thank you for your reply. Is there already an approximate schedule when this Tensor feature will be implemented?

1 Like