Hello!
I read this issue alread: Can't create a dataset with `float16` features · Issue #4981 · huggingface/datasets · GitHub. So I know that float16 is not supported by pyarrow. However, possibility to use float16 for audio after passing it to feature extracture would be very nice. Wav uses 16/24 bits to represent each sample. Hence, using float32 doesn’t bring better precission, leads to slower access and cache occupies 2x more memory than original dataset. It seems more efficient to just use example['audio']['array']
inside a collate function and perform padding, truncation and normalization there.
There is a chunk of code to show the issue:
from datasets import load_dataset
import time
import numpy as np
from transformers import AutoFeatureExtractor
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
def preprocess_function(examples):
audio_arrays = [x["array"] for x in examples["audio"]]
inputs = feature_extractor(
audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=False
)
return inputs
dataset = load_dataset("speech_commands", 'v0.02', split="validation")
encoded_dataset = dataset.map(preprocess_function, remove_columns="audio", batched=True)
indices = list(range(len(dataset)))
np.random.shuffle(indices)
tic = time.perf_counter()
for idx in indices:
audio = dataset[idx]["audio"]["array"]
#Every file is exactly 16000 samples so I made it that it's
#padded and truncated every time to simulate worst case
#I know this could be implemented much more efficient
if audio.shape[0] < 16100:
audio = np.pad(audio, (0,100))
if audio.shape[0] > 16000:
audio = audio[:16000]
audio = (audio - audio.mean()) / np.sqrt(audio.var() + 1e-7)
toc = time.perf_counter()
print(f"Iteration over dataset + padding + truncation + normalization took {toc - tic:0.4f} seconds")
tic = time.perf_counter()
for idx in indices:
audio = encoded_dataset[idx]["input_values"]
toc = time.perf_counter()
print(f"Iteration over encoded dataset took {toc - tic:0.4f} seconds")
If I see correctly this feature extractor does exactly what I do in this upper loop.
As a result I got:
Iteration over dataset + padding + truncation + normalization took 5.1508 seconds
Iteration over encoded dataset took 73.4492 seconds
Moreover original validation set is 300MB and cache file is twice as big due to float32 representation of features
If this was possible
features = Features({'data': Sequence(feature= Value(dtype='float16'))})
dataset_audio = dataset.map(lambda example: {"data": example["audio"]["array"].astype("float16")}, remove_columns="audio", features=features)
It would be lovely.
I did it on windows and HDD drive. I read that there are issues with virtual memory mapping on Windows, so I did it also on kaggle notebook and got similar results. However, I may be doing something wrong, so please point it if so.
datasets==2.9.0
pyarrow==11.0.0
transformers==4.26.1