When using the pipeline to get transcription with timestamps, it’s alright for some audio files, but for some of the files it returns the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-8cc132230b9b> in <module>
----> 1 prediction = pipe(dataset[0], return_timestamps=True)["chunks"]
4 frames
/usr/local/lib/python3.8/dist-packages/transformers/pipelines/automatic_speech_recognition.py in _find_timestamp_sequence(sequences, tokenizer, feature_extractor, max_source_positions)
104 sequence = sequence.squeeze(0)
105 # get rid of the `forced_decoder_idx` that are use to parametrize the generation
--> 106 begin_idx = np.where(sequence == timestamp_begin)[0].item() if timestamp_begin in sequence else 0
107 sequence = sequence[begin_idx:]
108
ValueError: can only convert an array of size 1 to a Python scalar
Below is the code to use the pipeline.
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-tiny",
chunk_length_s=30,
device=device,
)
filename = files[71][0]
mypath = '/content/drive/MyDrive/twitch_data/audios/prediction/'
audio, _ = librosa.load(mypath+ filename, sr = 16000)
my_dict = {"raw": [audio], 'sampling_rate': [16000]}
dataset = Dataset.from_dict(my_dict)
dataset.set_format(type="numpy", columns=["raw",'sampling_rate'])
prediction = pipe(dataset[0], return_timestamps=True)["chunks"]
I’m not sure if this is a bug, or if there’s something wrong with my audio files. Any help is appreciated!