Using datasets.from_generator() with a column that is an embedding

I’ve been trying multiple ways to build a Dataset object from a tsv containing two features: a simple integer label, and a sentence that will be embedded. Creating two tensors and wrapping them in a TensorDataset doesn’t built the necessary feature dict, so currently I’m trying. IterableDataset.from_generator()

My generator yields two items

yield { "embeddings": tt[i], "label": labels[i] }

My embedding for each sentence is a 2D array of shape 1,43 of float64

I try to build the Dataset like this

ds = datasets.IterableDataset.from_generator(gen, features={ "embeddings" : datasets.Array2D(shape=(1, 43), dtype='float64'), "label" : datasets.Value(dtype='int32') } )

I get the error “argument of type ‘Array2D’ is not iterable”. As was suggested somewhere in your documentation I tried to wrap the embeddings array in an ndarray before yielding it, but that didn’t help. How do I go about declaring this feature? Or am I barking up the wrong tree?

1 Like