Preprocessing & augmentation during training

Hi!

I am currently fine-tuning a Wav2Vec2 model and I would like to know how I could modify training samples.

In Tensorflow it’s possible to do this by calling dataset.map() but doing this with a datasets.arrow_dataset.Dataset will write a cache file. Besides that, it’s static and unnecessary.

So, this here is not an option:

train_dataset = train_dataset.map(preprocess_and_augment)

There is the data_collator and one could do it here. However, this is very likely not very efficient and I would have to take care of parallelization myself.

I’m looking for a similar way to process input examples like Dataset.map().

Hi! I think Dataset.set_transform is what you are looking for.

Yes, this would be something I’d like to do. The problem is that I am having a datasets.arrow_dataset.Dataset

$ print(train_dataset)
<class 'dataset.arrow_dataset.Dataset'>

which is what I get after calling builder_instance.as_dataset(..) following this description: Writing a dataset loading script — datasets 1.11.0 documentation

I have tried to call:

datasets.Dataset(train_dataset)`

giving me

TypeError: Expected a pyarrow.Table or a `datasets.table.Table` object

and

datasets.table.Table(train_dataset)

throws

AttributeError: 'Dataset' has no attribute 'schema'

I guess my question is: How can I get a datasets.Dataset from a dataset.arrow_dataset.Dataset?

I figured that I can do this:

train_dataset = datasets.Dataset(train_dataset.data)
train_dataset.set_transform(example_adapter.preprocess)

It’s a bit weird and I am not sure how efficient this is e.g. if I’d have to do more expensive computations though.