Using datasets.from_generator() with a column that is an embedding

Yes, thank you, that worked!

In order to use it for training, though, I had to add input_ids as a third feature in the generator - the direct output of the tokenizer before conversion to float.
I suppose it needed that to match up embedded words with those of the base model I was training.

Peter

1 Like