Using datasets.from_generator() with a column that is an embedding

pabramowitsch · August 20, 2025, 3:56am

I’ve been trying multiple ways to build a Dataset object from a tsv containing two features: a simple integer label, and a sentence that will be embedded. Creating two tensors and wrapping them in a TensorDataset doesn’t built the necessary feature dict, so currently I’m trying. IterableDataset.from_generator()

My generator yields two items

yield { "embeddings": tt[i], "label": labels[i] }

My embedding for each sentence is a 2D array of shape 1,43 of float64

I try to build the Dataset like this

ds = datasets.IterableDataset.from_generator(gen, features={ "embeddings" : datasets.Array2D(shape=(1, 43), dtype='float64'), "label" : datasets.Value(dtype='int32') } )

I get the error “argument of type ‘Array2D’ is not iterable”. As was suggested somewhere in your documentation I tried to wrap the embeddings array in an ndarray before yielding it, but that didn’t help. How do I go about declaring this feature? Or am I barking up the wrong tree?

John6666 · August 20, 2025, 4:51am

I think you need to pass a dataset.Features type instead of a dict type to IterableDataset.from_generator.

import numpy as np
import datasets as ds

# toy data
N, D = 5, 43
tt = [np.random.randn(1, D).astype(np.float64) for _ in range(N)]
labels = np.arange(N, dtype=np.int32)

# features MUST be datasets.Features(...)
features = ds.Features({
    "embeddings": ds.Array2D(shape=(1, D), dtype="float64"),
    "label": ds.Value("int32"),
})

def gen():
    for emb, lab in zip(tt, labels):
        yield {
            "embeddings": np.asarray(emb, dtype=np.float64).reshape(1, D),
            "label": int(lab),
        }

itds = ds.IterableDataset.from_generator(gen, features=features)
example = next(iter(itds))
print(type(example["embeddings"]), np.array(example["embeddings"]).shape)  # <class 'numpy.ndarray'> (1, 43)

pabramowitsch · August 20, 2025, 5:42am

Yes, thank you, that worked!

In order to use it for training, though, I had to add input_ids as a third feature in the generator - the direct output of the tokenizer before conversion to float.
I suppose it needed that to match up embedded words with those of the base model I was training.

Peter

Topic		Replies	Views
'datasets.iterable_dataset.IterableDataset' to 'datasets.dataset_dict.DatasetDict' 🤗Datasets	3	2206	June 7, 2023
Specifying a Sequence feature slows down the generation of a dataset 🤗Datasets	8	752	September 11, 2023
Create dataset consisting of numpy arrays, Sequence or ArrayND? 🤗Datasets	1	160	October 24, 2024
Problems with Dataset.from_dict() and Feature types 🤗Datasets	1	2235	September 6, 2021
Setting dataset feature value as numpy array 🤗Datasets	7	7972	November 14, 2023

Using datasets.from_generator() with a column that is an embedding

Related topics