Not supporting float16 for Audio Datasets slows things down

Hello!
I read this issue alread: Can't create a dataset with `float16` features 路 Issue #4981 路 huggingface/datasets 路 GitHub. So I know that float16 is not supported by pyarrow. However, possibility to use float16 for audio after passing it to feature extracture would be very nice. Wav uses 16/24 bits to represent each sample. Hence, using float32 doesn鈥檛 bring better precission, leads to slower access and cache occupies 2x more memory than original dataset. It seems more efficient to just use example['audio']['array'] inside a collate function and perform padding, truncation and normalization there.

There is a chunk of code to show the issue:

from datasets import load_dataset
import time
import numpy as np
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=False
    )
    return inputs


dataset = load_dataset("speech_commands", 'v0.02', split="validation")

encoded_dataset = dataset.map(preprocess_function, remove_columns="audio", batched=True)

indices = list(range(len(dataset)))
np.random.shuffle(indices)

tic = time.perf_counter()
for idx in indices:
    audio = dataset[idx]["audio"]["array"]
    #Every file is exactly 16000 samples so I made it that it's 
    #padded and truncated every time to simulate worst case
    #I know this could be implemented much more efficient
    if audio.shape[0] < 16100:
        audio = np.pad(audio, (0,100))
    if audio.shape[0] > 16000:
        audio = audio[:16000]
    audio = (audio - audio.mean()) / np.sqrt(audio.var() + 1e-7)
toc = time.perf_counter()

print(f"Iteration over dataset + padding + truncation + normalization took {toc - tic:0.4f} seconds")

tic = time.perf_counter()
for idx in indices:
    audio = encoded_dataset[idx]["input_values"]
toc = time.perf_counter()

print(f"Iteration over encoded dataset took {toc - tic:0.4f} seconds")

If I see correctly this feature extractor does exactly what I do in this upper loop.

As a result I got:

Iteration over dataset + padding + truncation + normalization took 5.1508 seconds
Iteration over encoded dataset took 73.4492 seconds

Moreover original validation set is 300MB and cache file is twice as big due to float32 representation of features

If this was possible

features = Features({'data': Sequence(feature= Value(dtype='float16'))})
dataset_audio = dataset.map(lambda example: {"data": example["audio"]["array"].astype("float16")}, remove_columns="audio", features=features)

It would be lovely.

I did it on windows and HDD drive. I read that there are issues with virtual memory mapping on Windows, so I did it also on kaggle notebook and got similar results. However, I may be doing something wrong, so please point it if so.

datasets==2.9.0
pyarrow==11.0.0
transformers==4.26.1

Hi ! You are comparing iterating on
1 - a dataset with an Audio() feature type, containing encoded WAV data and decoded as numpy arrays
2 - a dataset with a Sequence(Value(鈥渇loat32鈥)) feature type, containing float32 values and decoded as python lists

If 2 is much slower I鈥檇 bet it鈥檚 because the data is decoded as python lists and not numpy arrays.
You can try setting the output to be numpy arrays and compare again:

encoded_dataset = encoded_dataset.with_format("numpy")

This way it doesn鈥檛 need to copy the data to create slow python lists, and the numpy arrays data is memory mapped from the data on disk.

Let me know if that helps

@lhoestq Thank you for your response, this seems obvious now. It indeed helped, now it鈥檚 like this:

Iteration over dataset + padding + truncation + normalization took 5.7007 seconds
Iteration over encoded dataset took 3.2401 seconds

Still, there is a small issue that cache is twice as big as the dataset. Moreover, it doesn鈥檛 bring much new information. Most of the times this won鈥檛 be a proble but given some big audio dataset it may become. Are there any rumours if float16 will be supported by pyarrow soon? Do you think saving audio as float16 arrow file would be advantegous?

Is it possible to set this decoding to numpy or pytorch to be default one every time I load a dataset?

I guess you can try storing the transformed audio using the Audio() type - it will encode the data as WAV and save disk space.

To do so you can pass features=... to .map() to specify the feature types of the output dataset

Are there any rumours if float16 will be supported by pyarrow soon?

pyarrow.float16 does exist, but doesn鈥檛 seem to be fully implemented and causes casting errors

Is it possible to set this decoding to numpy or pytorch to be default one every time I load a dataset?

No you have to specify it everytime