I’m trying to use the pipeline so that I can support longer audio files with its chunking. I’m running into problems with audio files that have multiple channels. This happens whether I have it loaded as stereo or mono.
Error:
File "/home/model-server/tmp/models/xxx/handler.py", line 104, in inference
transcription = self.transcriber(inputs=classifier_inputs, chunk_length_s=10, stride_length_s=(4, 2))
org.pytorch.serve.wlm.WorkerThread - Backend response time: 2396
File "/home/venv/lib/python3.8/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 182, in __call__
return super().__call__(inputs, **kwargs)
File "/home/venv/lib/python3.8/site-packages/transformers/pipelines/base.py", line 1043, in __call__
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/home/venv/lib/python3.8/site-packages/transformers/pipelines/base.py", line 1064, in run_single
for model_inputs in self.preprocess(inputs, **preprocess_params):
File "/home/venv/lib/python3.8/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 249, in preprocess
raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")
ValueError: We expect a single channel audio input for AutomaticSpeechRecognitionPipeline
Here is the simplified code:
audio, sample_rate = librosa.load(io.BytesIO(file_as_string), mono=True, sr=16_000)
audio_array = [audio]
inputs = processor(audio_array, sampling_rate=16_000, return_tensors= "pt", padding=True)
with torch.no_grad():
logits = self.model(inputs[0].input_values, attention_mask=inputs[0].attention_mask).logits
classifier_inputs = inputs[0].input_values.numpy()
transcription = self.transcriber(inputs=classifier_inputs, chunk_length_s=10, stride_length_s=(4, 2))
I’m not sure what I’m missing.