Preprocessing & augmentation during training

sfalk · February 9, 2022, 1:32pm

Hi!

I am currently fine-tuning a Wav2Vec2 model and I would like to know how I could modify training samples.

In Tensorflow it’s possible to do this by calling dataset.map() but doing this with a datasets.arrow_dataset.Dataset will write a cache file. Besides that, it’s static and unnecessary.

So, this here is not an option:

train_dataset = train_dataset.map(preprocess_and_augment)

There is the data_collator and one could do it here. However, this is very likely not very efficient and I would have to take care of parallelization myself.

I’m looking for a similar way to process input examples like Dataset.map().

mariosasko · February 9, 2022, 4:14pm

Hi! I think Dataset.set_transform is what you are looking for.

sfalk · February 10, 2022, 8:08am

Yes, this would be something I’d like to do. The problem is that I am having a datasets.arrow_dataset.Dataset

$ print(train_dataset)
<class 'dataset.arrow_dataset.Dataset'>

which is what I get after calling builder_instance.as_dataset(..) following this description: Writing a dataset loading script — datasets 1.11.0 documentation

I have tried to call:

datasets.Dataset(train_dataset)`

giving me

TypeError: Expected a pyarrow.Table or a `datasets.table.Table` object

and

datasets.table.Table(train_dataset)

throws

AttributeError: 'Dataset' has no attribute 'schema'

I guess my question is: How can I get a datasets.Dataset from a dataset.arrow_dataset.Dataset?

sfalk · February 10, 2022, 8:29am

I figured that I can do this:

train_dataset = datasets.Dataset(train_dataset.data)
train_dataset.set_transform(example_adapter.preprocess)

It’s a bit weird and I am not sure how efficient this is e.g. if I’d have to do more expensive computations though.

Topic		Replies	Views
Add data augmentation process during training every epoch Beginners	2	2850	January 20, 2021
How to Train a Model with Pytorch Lightning with Huggingface 🤗Datasets	1	1145	April 4, 2024
How to ensure GPU utilisation when preprocessing huggingface datasets Beginners	1	729	April 27, 2024
Dataset map during runtime 🤗Datasets	2	1271	September 13, 2023
Dataset map preprocess throws ArrowInvalid 🤗Datasets	5	1272	September 16, 2021

Preprocessing & augmentation during training

Related topics