Hi there. I’ve been trying out Hugging Face’s implementation of Wave2Vec2 for transcribing on Colab Pro, and got pretty good results from short speeches under 80 seconds. Anything beyond that just crashes the notebook, even when I set it to High RAM, or compress the audio file drastically.
Is there a practical limit to the length of the audio clip that can/should be run on HF-Wav2Vec2? I tried looking for documentation on this, but might have missed it.
import os
import sys
import json
import requests
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
API_TOKEN = os.getenv('HUGGINGFACE_PLAY_READ')
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/facebook/wav2vec2-base-960h"
def query(filename):
with open(filename, "rb") as f:
data = f.read()
response = requests.request("POST", API_URL, headers=headers, data=data)
return json.loads(response.content.decode("utf-8"))
filename = sys.argv[1]
try:
data = query(filename)
print(f'{data["text"]}')
except:
import pdb, sys
e, m, tb = sys.exc_info()
pdb.post_mortem(tb)
pdb.set_trace()
When I run it with a 100-second mp3 mono clip, it works fine.
Note that the mp3 is actually sampled at 44100 Hz, not 16 kHz as the documentation requires. So I would guess that the quality might suffer.
When I use a 200-second clip, I get an error because data is just this: {'error': 'Service Unavailable'}
When I use an 800-second clip, data == {'error': 'Model facebook/wav2vec2-base-960h is currently loading', 'estimated_time': 20.0}
Finally, at 1600 seconds, I get a clear response.content value of 'Payload reached size limit.', but still no information on what the size limit is…
I’d love to see the limits of the model more clearly documented, and/or tutorials that provide a more robust ASR function.