Getting hidden states from the "automatic-speech-recognition" pipeline

Hi - I’ve found success in my attempts to transcribe larger audio clips using the pipeline class. However, I’d also like to get the hidden states (outputs of last layer) of all possible timepoints of these audio files. I’ve tried setting return_hidden_state=True when I initialize the pipeline object, but this does not affect the output. How else could I retrieve the hidden states for long audio files using pipeline class?

from transformers import pipeline
import soundfile as sf

filename = 'test.wav'
audio_input, sample_rate = sf.read(filename)

pipe = pipeline(model="facebook/wav2vec2-base-960h", return_hidden_states=True)

out = pipe(audio_input, chunk_length_s=10, stride_length_s=2, return_hidden_states=True, return_timestamps="word")

The out only contains:

 'chunks': [{'text': 'AND', 'timestamp': (0.34, 0.4)},
  {'text': 'THEN', 'timestamp': (0.46, 0.58)},
  {'text': 'NOW', 'timestamp': (1.54, 1.7)},
  {'text': 'WERE', 'timestamp': (1.78, 1.92)},
  {'text': 'RECORDING', 'timestamp': (1.96, 2.32)},
  {'text': 'AN', 'timestamp': (2.38, 2.42)},
  {'text': 'IOU', 'timestamp': (2.52, 2.62)},
  {'text': 'ALL', 'timestamp': (3.1, 3.2)},
  {'text': 'RIGHT', 'timestamp': (3.24, 3.38)},
  {'text': 'SOIXSIDE', 'timestamp': (5.18, 5.72)},
  {'text': 'EGG', 'timestamp': (5.78, 5.86)},
  {'text': 'AM', 'timestamp': (6.88, 7.02)},
  {'text': 'I', 'timestamp': (8.98, 9.0)},
  {'text': 'WAS', 'timestamp': (9.06, 9.12)},
  {'text': 'EVERYBODY', 'timestamp': (9.18, 9.48)}]}
1 Like