Preprocessing & augmentation during training


I am currently fine-tuning a Wav2Vec2 model and I would like to know how I could modify training samples.

In Tensorflow it’s possible to do this by calling but doing this with a datasets.arrow_dataset.Dataset will write a cache file. Besides that, it’s static and unnecessary.

So, this here is not an option:

train_dataset =

There is the data_collator and one could do it here. However, this is very likely not very efficient and I would have to take care of parallelization myself.

I’m looking for a similar way to process input examples like

Hi! I think Dataset.set_transform is what you are looking for.

Yes, this would be something I’d like to do. The problem is that I am having a datasets.arrow_dataset.Dataset

$ print(train_dataset)
<class 'dataset.arrow_dataset.Dataset'>

which is what I get after calling builder_instance.as_dataset(..) following this description: Writing a dataset loading script — datasets 1.11.0 documentation

I have tried to call:


giving me

TypeError: Expected a pyarrow.Table or a `datasets.table.Table` object




AttributeError: 'Dataset' has no attribute 'schema'

I guess my question is: How can I get a datasets.Dataset from a dataset.arrow_dataset.Dataset?

I figured that I can do this:

train_dataset = datasets.Dataset(

It’s a bit weird and I am not sure how efficient this is e.g. if I’d have to do more expensive computations though.