I am trying out the wav2vec2 model for ASR from the huggingface library. Here, I… am passing a 7 min(~15 MB file) long wav file having a conversation(english) to the wav2vec2 model. I am getting "can't allocate memory" error. I found that the model uses all 64 GB of the available RAM. Can anyone help with this.
- `transformers` version: 4.3.2
- Platform: Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.3
- PyTorch version (GPU?): 1.7.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: (NA)
- Using distributed or parallel set-up in script?: (NA)
Code
```
import os
import librosa
import soundfile as sf
from pydub import AudioSegment
def convert_audio_segment(fp, upload_dir_path):
"""Convert audio file"""
USER_UPLOAD_DIR = upload_dir_path
formats_to_convert = ['.m4a']
dirpath = os.path.abspath(USER_UPLOAD_DIR)
if fp.endswith(tuple(formats_to_convert)):
(path, file_extension) = os.path.splitext(fp)
file_extension_final = file_extension.replace('.', '')
file_handle = ''
try:
track = AudioSegment.from_file(fp,
file_extension_final)
print("track", track)
wav_path = fp.replace(file_extension_final, 'wav')
file_handle = track.export(wav_path, format='wav')
except Exception:
print("ERROR CONVERTING " + str(fp))
return file_handle
else:
print("No file format conversion required " + str(fp))
return fp
def load_wav2vec_100h_model():
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-100h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-100h")
return tokenizer, model
def correct_sentence(input_text):
sentences = nltk.sent_tokenize(input_text)
return (' '.join([s.replace(s[0],s[0].capitalize(),1) for s in sentences]))
def asr_transcript(tokenizer, model, input_file):
speech, fs = sf.read(input_file)
if len(speech.shape) > 1:
speech = speech[:,0] + speech[:,1]
if fs != 16000:
speech = librosa.resample(speech, fs, 16000)
input_values = tokenizer(speech, return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.decode(predicted_ids[0])
return correct_sentence(transcription.lower())
if __name__ == "__main__":
tokenizer_100h, model_100h = load_wav2vec_100h_model()
wav_input = 'Recording_biweu.wav'
fp = wav_input
processed_file = convert_audio_segment(str(fp), str(data_dir))
text = asr_transcript(tokenizer_100h,model_100h,processed_file)
print(text)
```
I am adding more details about my wav file here
```
General
Complete name : Recording_biweu.wav
Format : Wave
File size : 13.8 MiB
Duration : 7 min 30 s
Overall bit rate mode : Constant
Overall bit rate : 256 kb/s
Track name : Recording_biweu
Recorded date : 2021
Writing application : Lavf57.83.100
Audio
Format : PCM
Format settings : Little / Signed
Codec ID : 1
Duration : 7 min 30 s
Bit rate mode : Constant
Bit rate : 256 kb/s
Channel(s) : 1 channel
Sampling rate : 16.0 kHz
Bit depth : 16 bits
Stream size : 13.8 MiB (100%)
```
Error
```
Some weights of the model checkpoint at facebook/wav2vec2-base-100h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.mask_time_emb_vector']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
File "asr_wav2vec2.py", line 130, in <module>
text = asr_transcript(tokenizer_100h,model_100h,processed_file)
File "asr_wav2vec2.py", line 96, in asr_transcript
logits = model(input_values).logits
File "/home/joel/pyvenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/joel/pyvenv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 795, in forward
outputs = self.wav2vec2(
File "/home/joel/pyvenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/joel/pyvenv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 646, in forward
encoder_outputs = self.encoder(
File "/home/joel/pyvenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/joel/pyvenv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 457, in forward
hidden_states, attn_weights = layer(hidden_states, output_attentions=output_attentions)
File "/home/joel/pyvenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/joel/pyvenv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 392, in forward
hidden_states, attn_weights, _ = self.attention(hidden_states, output_attentions=output_attentions)
File "/home/joel/pyvenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/joel/pyvenv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 286, in forward
attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 24373495488 bytes. Error code 12 (Cannot allocate memory)
```