Automatic Speech Recognition - Pipeline Error when processing single-channel or multi-channel audio

I’m trying to use the pipeline so that I can support longer audio files with its chunking. I’m running into problems with audio files that have multiple channels. This happens whether I have it loaded as stereo or mono.

Error:

File "/home/model-server/tmp/models/xxx/handler.py", line 104, in inference
transcription = self.transcriber(inputs=classifier_inputs, chunk_length_s=10, stride_length_s=(4, 2))
org.pytorch.serve.wlm.WorkerThread - Backend response time: 2396
File "/home/venv/lib/python3.8/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 182, in __call__
  return super().__call__(inputs, **kwargs)
File "/home/venv/lib/python3.8/site-packages/transformers/pipelines/base.py", line 1043, in __call__
  return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/home/venv/lib/python3.8/site-packages/transformers/pipelines/base.py", line 1064, in run_single
  for model_inputs in self.preprocess(inputs, **preprocess_params):
File "/home/venv/lib/python3.8/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 249, in preprocess
  raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")
ValueError: We expect a single channel audio input for AutomaticSpeechRecognitionPipeline

Here is the simplified code:

audio, sample_rate = librosa.load(io.BytesIO(file_as_string), mono=True, sr=16_000)
audio_array = [audio]
inputs = processor(audio_array, sampling_rate=16_000, return_tensors= "pt", padding=True)

with torch.no_grad():
    logits = self.model(inputs[0].input_values, attention_mask=inputs[0].attention_mask).logits

classifier_inputs = inputs[0].input_values.numpy()
transcription = self.transcriber(inputs=classifier_inputs, chunk_length_s=10, stride_length_s=(4, 2))

I’m not sure what I’m missing.

I’ve also tried the following which produces the same error.

audio, sample_rate = librosa.load(io.BytesIO(file_as_string), mono=False, sr=16_000)  # pull as stereo instead
audio_array = [audio[0]] # try pulling one channel
inputs = processor(audio_array, sampling_rate=16_000, return_tensors= "pt", padding=True)

with torch.no_grad():
    logits = self.model(inputs[0].input_values, attention_mask=inputs[0].attention_mask).logits

classifier_inputs = inputs[0].input_values.numpy()
transcription = self.transcriber(inputs=classifier_inputs, chunk_length_s=10, stride_length_s=(4, 2))