Whisper pipeline return_timestamps error

When using the pipeline to get transcription with timestamps, it’s alright for some audio files, but for some of the files it returns the error:

ValueError                                Traceback (most recent call last)
<ipython-input-16-8cc132230b9b> in <module>
----> 1 prediction = pipe(dataset[0], return_timestamps=True)["chunks"]

4 frames
/usr/local/lib/python3.8/dist-packages/transformers/pipelines/automatic_speech_recognition.py in _find_timestamp_sequence(sequences, tokenizer, feature_extractor, max_source_positions)
    104         sequence = sequence.squeeze(0)
    105         # get rid of the `forced_decoder_idx` that are use to parametrize the generation
--> 106         begin_idx = np.where(sequence == timestamp_begin)[0].item() if timestamp_begin in sequence else 0
    107         sequence = sequence[begin_idx:]

ValueError: can only convert an array of size 1 to a Python scalar

Below is the code to use the pipeline.

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(

filename = files[71][0]
mypath = '/content/drive/MyDrive/twitch_data/audios/prediction/'
audio, _ = librosa.load(mypath+ filename, sr = 16000)

my_dict = {"raw": [audio], 'sampling_rate': [16000]}
dataset = Dataset.from_dict(my_dict)
dataset.set_format(type="numpy", columns=["raw",'sampling_rate'])

prediction = pipe(dataset[0], return_timestamps=True)["chunks"]

I’m not sure if this is a bug, or if there’s something wrong with my audio files. Any help is appreciated!