Batching in "automatic-speech-recognition" pipelines

Saeid · September 30, 2022, 11:09pm

I am using the following code to send a batch of inputs to the automatic-speech-recognition pipeline:

from transformers import pipeline
from datasets import load_dataset
import numpy as np

ds = load_dataset(
    "hf-internal-testing/librispeech_asr_demo",
    "clean",
    split="validation")

input_data = ds[0]["audio"]["array"]
batch_test = np.vstack((input_data, input_data)) 
for i in range(5):
       batch_test = np.vstack((batch_test, input_data))

task = "automatic-speech-recognition"
model_name = 'facebook/s2t-small-librispeech-asr'
batch_size = 5

model  = pipeline(
    task=task,
    model=model_name,
    batch_size=batch_size)

res = model(batch_test)
res

However I am recieving the following error which seems that the huggingface model is not able to accepting stacked audio inputs and treats them as multi-channels outputs:

ValueError: We expect a single channel audio input for AutomaticSpeechRecognitionPipeline

Looking at the huggingface code it seems that following line is returning the mentioned error. I couldn’t find anything related to preprocessing batched intput in the code, how I can I enable batching for input to the huggingface models.

github.com

huggingface/transformers/blob/5cd16f01db3b5499d4665e8624801ed30ba87bdd/src/transformers/pipelines/automatic_speech_recognition.py#L248


      
                  if stride[0] + stride[1] > inputs.shape[0]:
                      raise ValueError("Stride is too large for input")
          
          
        # Stride needs to get the chunk length here, it's going to get
                  # swallowed by the `feature_extractor` later, and then batching
                  # can add extra data in the inputs, so we need to keep track
                  # of the original length in the stride so we can cut properly.
                  stride = (inputs.shape[0], int(round(stride[0] * ratio)), int(round(stride[1] * ratio)))
          if not isinstance(inputs, np.ndarray):
              raise ValueError(f"We expect a numpy ndarray as input, got `{type(inputs)}`")
          if len(inputs.shape) != 1:
              raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")
          
          
if chunk_length_s:
              if stride_length_s is None:
                  stride_length_s = chunk_length_s / 6
          
          
    if isinstance(stride_length_s, (int, float)):
                  stride_length_s = [stride_length_s, stride_length_s]
          
          
    # XXX: Carefuly, this variable will not exist in `seq2seq` setting.

csantos · April 19, 2024, 7:30pm

I know this is an old post, but for future readers the solution to this problem is to pass a python list of numpy arrays (instead of a single batched numpy array) to the pipeline model:

model([waveform.cpu().numpy() for waveform in batched_waveforms])

system · May 23, 2024, 9:02am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Automatic Speech Recognition - Pipeline Error when processing single-channel or multi-channel audio 🤗Transformers	1	1682	July 15, 2022
Error Iterating over KeyDataset 🤗Datasets	0	30	August 30, 2024
Transformers 4.9.0 on SageMaker Amazon SageMaker	12	1969	March 25, 2022
Optimization strategie 🤗Transformers	0	267	October 21, 2022
Speech to text with mic and hugging-face transformers., getting empty results Beginners	0	590	September 9, 2022

Batching in "automatic-speech-recognition" pipelines

Related topics