I want to generate a huggingface dataset from a generator function. Part of the data I want to store are long arrays (in the order of 500,000 elements).
Here is an easy example
from datasets import Dataset
import numpy as np
np.random.seed(1000)
def generator():
for _ in range(300):
yield {"data": np.random.rand(100000).astype(np.float32)}
ds = Dataset.from_generator(generator=generator, writer_batch_size=100)
This finishes in about one second. But when I specify the feature of the “data” column, like here
from datasets import Dataset, Value, Features, Sequence
import numpy as np
np.random.seed(1000)
def generator():
for _ in range(300):
yield {"data": np.random.rand(100000).astype(np.float32)}
features = {"data": Sequence(Value("float32"))}
ds2 = Dataset.from_generator(generator=generator, writer_batch_size=100, features=Features(features))
it takes 20 seconds, even though the automatically constructed feature in the first case is the same as the specified one in the second case.
Why is it slower in this case to specify the feature? And is there a way to get the speed of the first case when specifing the feature?
In the first case where you don’t specify the features, the datasets library can use an optimized default feature structure for the data, which is faster to process. However, in the second case where you explicitly specify the feature structure, the library has to perform additional processing to match your specified feature structure. This additional processing can slow down the dataset creation.
To get the speed of the first case while specifying the feature, you can create a custom feature using Value without wrapping it in a Sequence. This way, you specify the data type directly without adding an extra level of sequence. Here’s how you can do it:
from datasets import Dataset, Value, Features
import numpy as np
np.random.seed(1000)
def generator():
for _ in range(300):
yield {"data": np.random.rand(100000).astype(np.float32)}
# Specify the feature directly without Sequence
features = {"data": Value("float32")}
ds2 = Dataset.from_generator(generator=generator, writer_batch_size=100, features=Features(features))
By specifying the feature this way, you should see improved speed similar to the first case, as you’re avoiding the additional overhead introduced by the unnecessary Sequence wrapping.
TypeError: only size-1 arrays can be converted to Python scalars
If I understood the features correctly it expects the “data” to be a single float32 number instead of an array of float32 number if the feature is specified as features = {“data”: Value(“float32”)}.
The error suggests that using Value("float32") attempts to cast the entire numpy array to a single scalar, causing the problem.
Let’s reconsider:
We keep the Sequence since we are dealing with arrays.
There could be an underlying inefficiency or bottleneck when specifying the features. This could be due to the way the library checks and validates the data against the provided features, leading to the slow performance.
Let’s try to use the Sequence feature type with an understanding of the length of the sequence.
from datasets import Dataset, Value, Features, Sequence
import numpy as np
np.random.seed(1000)
def generator():
for _ in range(300):
yield {"data": np.random.rand(100000).astype(np.float32)}
# Specify the sequence length in the feature definition
features = {"data": Sequence(feature=Value("float32"), length=100000)}
ds2 = Dataset.from_generator(generator=generator, writer_batch_size=100, features=Features(features))
If this does not improve the performance significantly, it’s either a potential inefficiency in the datasets library itself when explicitly specifying features or I don’t know, something else.
Is there a reason why the ArrayXD feature is only defined for Dimensions >= 2? In a fork I implemented an Array1D (mostly just copied from Array2D) and when I use that my problem is solved. But I am not sure if this would cause problems further down the line after creating the dataset. I could of course just use a Array2D and transform my 1D array to a 2D array with shape (-1, 1). But this is a bit inelegant and would also cause a bit of overhead as I always have to retransform this when using it. So I am wondering why there is no native Array1D implementation?
The Array2D with shape (-1,1) is also not an option, because then the cast_to_python_object function is very slow, because a list comprehension is done on any but the last dimension. In my case the array has the shape (500000, 1), which takes very long.
Again I think this could be solved by a real Array1D implantation. Or is there another solution to handle long 1D arrays efficiently?
If not, I could also prepare a PR which adds the Array1D feature.
We plan to introduce a Tensor type (backed by the recently introduced PyArrow Tensor type) soon to make handling of 1D arrays efficient (storage-wise it’s as efficient as Sequence(Value(<dtype>), length=np.prod(<shape>)))). In the meantime, generate the dataset using ds = Dataset.from_generator(generator=generator, writer_batch_size=100) and then cast it with ds = ds.cast_column("data", Sequence(Value("float32"), length=100000)) to avoid the slow encode_nested_example code path.