Hi - I’ve found success in my attempts to transcribe larger audio clips using the pipeline class. However, I’d also like to get the hidden states (outputs of last layer) of all possible timepoints of these audio files. I’ve tried setting return_hidden_state=True
when I initialize the pipeline object, but this does not affect the output. How else could I retrieve the hidden states for long audio files using pipeline class?
from transformers import pipeline
import soundfile as sf
filename = 'test.wav'
audio_input, sample_rate = sf.read(filename)
pipe = pipeline(model="facebook/wav2vec2-base-960h", return_hidden_states=True)
out = pipe(audio_input, chunk_length_s=10, stride_length_s=2, return_hidden_states=True, return_timestamps="word")
The out
only contains:
'chunks': [{'text': 'AND', 'timestamp': (0.34, 0.4)},
{'text': 'THEN', 'timestamp': (0.46, 0.58)},
{'text': 'NOW', 'timestamp': (1.54, 1.7)},
{'text': 'WERE', 'timestamp': (1.78, 1.92)},
{'text': 'RECORDING', 'timestamp': (1.96, 2.32)},
{'text': 'AN', 'timestamp': (2.38, 2.42)},
{'text': 'IOU', 'timestamp': (2.52, 2.62)},
{'text': 'ALL', 'timestamp': (3.1, 3.2)},
{'text': 'RIGHT', 'timestamp': (3.24, 3.38)},
{'text': 'SOIXSIDE', 'timestamp': (5.18, 5.72)},
{'text': 'EGG', 'timestamp': (5.78, 5.86)},
{'text': 'AM', 'timestamp': (6.88, 7.02)},
{'text': 'I', 'timestamp': (8.98, 9.0)},
{'text': 'WAS', 'timestamp': (9.06, 9.12)},
{'text': 'EVERYBODY', 'timestamp': (9.18, 9.48)}]}