I’ve been trying multiple ways to build a Dataset object from a tsv containing two features: a simple integer label, and a sentence that will be embedded. Creating two tensors and wrapping them in a TensorDataset doesn’t built the necessary feature dict, so currently I’m trying. IterableDataset.from_generator()
My generator yields two items
yield { "embeddings": tt[i], "label": labels[i] }
My embedding for each sentence is a 2D array of shape 1,43 of float64
I try to build the Dataset like this
ds = datasets.IterableDataset.from_generator(gen, features={ "embeddings" : datasets.Array2D(shape=(1, 43), dtype='float64'), "label" : datasets.Value(dtype='int32') } )
I get the error “argument of type ‘Array2D’ is not iterable”. As was suggested somewhere in your documentation I tried to wrap the embeddings array in an ndarray before yielding it, but that didn’t help. How do I go about declaring this feature? Or am I barking up the wrong tree?
1 Like
I think you need to pass a dataset.Features
type instead of a dict
type to IterableDataset.from_generator
.
import numpy as np
import datasets as ds
# toy data
N, D = 5, 43
tt = [np.random.randn(1, D).astype(np.float64) for _ in range(N)]
labels = np.arange(N, dtype=np.int32)
# features MUST be datasets.Features(...)
features = ds.Features({
"embeddings": ds.Array2D(shape=(1, D), dtype="float64"),
"label": ds.Value("int32"),
})
def gen():
for emb, lab in zip(tt, labels):
yield {
"embeddings": np.asarray(emb, dtype=np.float64).reshape(1, D),
"label": int(lab),
}
itds = ds.IterableDataset.from_generator(gen, features=features)
example = next(iter(itds))
print(type(example["embeddings"]), np.array(example["embeddings"]).shape) # <class 'numpy.ndarray'> (1, 43)
Yes, thank you, that worked!
In order to use it for training, though, I had to add input_ids as a third feature in the generator - the direct output of the tokenizer before conversion to float.
I suppose it needed that to match up embedded words with those of the base model I was training.
Peter
1 Like