Hi everyone,
I joined your efforts today.
Regarding the large disk space consumption I found that after this step:
# Preprocessing the datasets.
# We need to read the aduio files as arrays and tokenize the targets.
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
batch["sampling_rate"] = 16_000
batch["target_text"] = batch["text"]
return batch
train_dataset = train_dataset.map(
speech_file_to_array_fn,
remove_columns=train_dataset.column_names,
num_proc=data_args.preprocessing_num_workers,
)
eval_dataset = eval_dataset.map(
speech_file_to_array_fn,
remove_columns=eval_dataset.column_names,
num_proc=data_args.preprocessing_num_workers,
)
although the batch[‘speech’] numpy arrays are of type fp32, the arrow table train_dataset.data reports that the values are doubles. I think this is because it is converted to a python List[float] somewhere before it is added to the arrow table, and the inferred format becomes fp64.
Thus, the uncompressed size is potentially much larger, compared to compressed mp3 or uncompressed fp32.
Additionally, this step does not seem to run for me at all when using multiple workers.
Im looking to replace this step by saving raw tensors in fp32 to disk and then using a custom dataset during training.