I am using the following code to send a batch of inputs to the automatic-speech-recognition pipeline:
from transformers import pipeline
from datasets import load_dataset
import numpy as np
ds = load_dataset(
"hf-internal-testing/librispeech_asr_demo",
"clean",
split="validation")
input_data = ds[0]["audio"]["array"]
batch_test = np.vstack((input_data, input_data))
for i in range(5):
batch_test = np.vstack((batch_test, input_data))
task = "automatic-speech-recognition"
model_name = 'facebook/s2t-small-librispeech-asr'
batch_size = 5
model = pipeline(
task=task,
model=model_name,
batch_size=batch_size)
res = model(batch_test)
res
However I am recieving the following error which seems that the huggingface model is not able to accepting stacked audio inputs and treats them as multi-channels outputs:
ValueError: We expect a single channel audio input for AutomaticSpeechRecognitionPipeline
Looking at the huggingface code it seems that following line is returning the mentioned error. I couldn’t find anything related to preprocessing batched intput in the code, how I can I enable batching for input to the huggingface models.
I know this is an old post, but for future readers the solution to this problem is to pass a python list of numpy arrays (instead of a single batched numpy array) to the pipeline model:
model([waveform.cpu().numpy() for waveform in batched_waveforms])