I am currently fine-tuning a Wav2Vec2 model and I would like to know how I could modify training samples.
In Tensorflow it’s possible to do this by calling
dataset.map() but doing this with a
datasets.arrow_dataset.Dataset will write a cache file. Besides that, it’s static and unnecessary.
So, this here is not an option:
train_dataset = train_dataset.map(preprocess_and_augment)
There is the
data_collator and one could do it here. However, this is very likely not very efficient and I would have to take care of parallelization myself.
I’m looking for a similar way to process input examples like
Hi! I think
Dataset.set_transform is what you are looking for.
Yes, this would be something I’d like to do. The problem is that I am having a
which is what I get after calling
builder_instance.as_dataset(..) following this description: Writing a dataset loading script — datasets 1.11.0 documentation
I have tried to call:
TypeError: Expected a pyarrow.Table or a `datasets.table.Table` object
AttributeError: 'Dataset' has no attribute 'schema'
I guess my question is: How can I get a
datasets.Dataset from a
I figured that I can do this:
train_dataset = datasets.Dataset(train_dataset.data)
It’s a bit weird and I am not sure how efficient this is e.g. if I’d have to do more expensive computations though.