Expanding an Audio Dataset with datasets.map()?

scasutt · March 16, 2022, 9:17am

Hello
After a couple of hours trying to get this to work I need to ask

I’m trying to expand a dataset (used with Wav2Vec2 for ASR) with following Idea:

dataset_expansion = dataset

#adding simple noise to dataset expansion
def add_simple_noise(batch):
audio = batch[‘audio_path’]
noise = np.asarray(0.01*np.random.randn(len(audio[“array”])))
audio[“array”] = audio[“array”] + noise
return batch

#map simple noise to training set
dataset_expansion = dataset_expansion.map(add_simple_noise)
dataset_expansion = dataset_expansion.cast_column(“audio_path”, datasets.Audio(sampling_rate=16_000))

The mapping itself seems to work and noise is added.

But the mapping does not seem to be correct:

By trying to concat the datasets together:
dataset = datasets.concatenate_datasets([dataset[“train”],dataset_expansion[“train”]])

Throws following error:
ArrowInvalid: Schema at index 1 was different:
audio_path: string
text: string
sampling_rate: int64
train_or_test: string
vs
audio_path: struct<array: list<item: double>, path: string, sampling_rate: int64>
text: string
sampling_rate: int64
train_or_test: string

Checking the features:
dataset[“train”].features
{‘audio_path’: Audio(sampling_rate=16000, mono=True, id=None),
‘text’: Value(dtype=‘string’, id=None),
‘sampling_rate’: Value(dtype=‘int64’, id=None),
‘train_or_test’: Value(dtype=‘string’, id=None)}

dataset_expansion[“train”].features
{‘audio_path’: Audio(sampling_rate=16000, mono=True, id=None),
‘text’: Value(dtype=‘string’, id=None),
‘sampling_rate’: Value(dtype=‘int64’, id=None),
‘train_or_test’: Value(dtype=‘string’, id=None)}

The dataset was loaded as follows:
feature_dict = {“audio_path”: datasets.Audio(sampling_rate=16_000),“text”: datasets.Value(“string”)}
data_features = datasets.Features(feature_dict)

dataset = load_dataset(“csv”,
data_files={“train”:“toy_train_data.csv”,
“test”:“toy_test_data.csv”},
)
dataset = dataset.cast_column(“audio_path”, datasets.Audio(sampling_rate=16_000,mono=True))
dataset = dataset.remove_columns(“Unnamed: 0”)

I might just be missing something really small. But I just can’t seem to find whatever needs to be done to get this to work

The alternative to doing this on the fly would be to make a copy of the data and add noise there.

If anyone could point me in the right direction, I’d really appreciate it.

Thank you and have a great day.

scasutt · March 16, 2022, 10:02am

Additional information:

dataset[“train”][0][“audio_path”]
{‘path’: ‘./audio/ch_ag_0006.wav’,
‘array’: array([-3.0517578e-05, -3.0517578e-05, -3.0517578e-05, …,
-1.2207031e-04, -9.1552734e-05, 0.0000000e+00], dtype=float32),
‘sampling_rate’: 16000}

dataset_expansion[“train”][0][“audio_path”]

AttributeError: ‘dict’ object has no attribute ‘endswith’

mariosasko · March 21, 2022, 12:00pm

Hi! Which version of datasets are you using? I’m pretty sure this issue can be resolved by using the newest version of datasets, which you can install as follows:

pip install -U datasets

Let me know if that doesn’t help.

scasutt · March 21, 2022, 1:08pm

Hello Mario
Thank you for your time.
That was one of my assumptions at first, as the original version ran on 1.17. (and didn’t work as expected). This version of the script is running on datasets 2.0.0 (and transformers 4.17.0).

In the meantime I’m doing the audio expansion locally and uploading the expanded audio to run my tests. Not quite as elegant as I would like, but it works

Cheers

Stefan

ai-nikolai · December 5, 2024, 9:39am

Any luck with this one?

Topic		Replies	Views
Datasets map modifying audio array to list? 🤗Datasets	1	1272	November 29, 2021
How to load this simple audio data set and use dataset.map without memory issues? 🤗Datasets	12	4254	December 10, 2024
Prepare func failed when mapped on audio dataset Beginners	0	329	July 16, 2022
About dataset map 🤗Datasets	5	402	August 20, 2023
Setting an array with a sequence using Huggingface dataset map() Beginners	1	1479	February 17, 2022

Expanding an Audio Dataset with datasets.map()?

Related topics