Using datasets.from_generator() with a column that is an embedding

I’ve been trying multiple ways to build a Dataset object from a tsv containing two features: a simple integer label, and a sentence that will be embedded. Creating two tensors and wrapping them in a TensorDataset doesn’t built the necessary feature dict, so currently I’m trying. IterableDataset.from_generator()

My generator yields two items

yield { "embeddings": tt[i], "label": labels[i] }

My embedding for each sentence is a 2D array of shape 1,43 of float64

I try to build the Dataset like this

ds = datasets.IterableDataset.from_generator(gen, features={ "embeddings" : datasets.Array2D(shape=(1, 43), dtype='float64'), "label" : datasets.Value(dtype='int32') } )

I get the error “argument of type ‘Array2D’ is not iterable”. As was suggested somewhere in your documentation I tried to wrap the embeddings array in an ndarray before yielding it, but that didn’t help. How do I go about declaring this feature? Or am I barking up the wrong tree?

1 Like

I think you need to pass a dataset.Features type instead of a dict type to IterableDataset.from_generator.

import numpy as np
import datasets as ds

# toy data
N, D = 5, 43
tt = [np.random.randn(1, D).astype(np.float64) for _ in range(N)]
labels = np.arange(N, dtype=np.int32)

# features MUST be datasets.Features(...)
features = ds.Features({
    "embeddings": ds.Array2D(shape=(1, D), dtype="float64"),
    "label": ds.Value("int32"),
})

def gen():
    for emb, lab in zip(tt, labels):
        yield {
            "embeddings": np.asarray(emb, dtype=np.float64).reshape(1, D),
            "label": int(lab),
        }

itds = ds.IterableDataset.from_generator(gen, features=features)
example = next(iter(itds))
print(type(example["embeddings"]), np.array(example["embeddings"]).shape)  # <class 'numpy.ndarray'> (1, 43)

Yes, thank you, that worked!

In order to use it for training, though, I had to add input_ids as a third feature in the generator - the direct output of the tokenizer before conversion to float.
I suppose it needed that to match up embedded words with those of the base model I was training.

Peter

1 Like