Using datasets.from_generator() with a column that is an embedding

pabramowitsch · August 20, 2025, 5:42am

Yes, thank you, that worked!

In order to use it for training, though, I had to add input_ids as a third feature in the generator - the direct output of the tokenizer before conversion to float.
I suppose it needed that to match up embedded words with those of the base model I was training.

Peter

Topic		Replies	Views
'datasets.iterable_dataset.IterableDataset' to 'datasets.dataset_dict.DatasetDict' 🤗Datasets	3	2202	June 7, 2023
Specifying a Sequence feature slows down the generation of a dataset 🤗Datasets	8	752	September 11, 2023
Create dataset consisting of numpy arrays, Sequence or ArrayND? 🤗Datasets	1	159	October 24, 2024
Problems with Dataset.from_dict() and Feature types 🤗Datasets	1	2234	September 6, 2021
Setting dataset feature value as numpy array 🤗Datasets	7	7962	November 14, 2023

Using datasets.from_generator() with a column that is an embedding

Related topics