File size/speech length limit for Wave2Vec2?

Hi there. I’ve been trying out Hugging Face’s implementation of Wave2Vec2 for transcribing on Colab Pro, and got pretty good results from short speeches under 80 seconds. Anything beyond that just crashes the notebook, even when I set it to High RAM, or compress the audio file drastically.

Is there a practical limit to the length of the audio clip that can/should be run on HF-Wav2Vec2? I tried looking for documentation on this, but might have missed it.

Appreciate any pointers on this.

4 Likes

Answering my own question in case anyone stumbles on this and wants a quick solution: seems like it’s a memory issue. I cobbled together a simple if clumsy way to transcribe the split-up clips one at a time. See attached screen-grab or check out the notebooks in my repo for this project: GitHub - chuachinhon/wav2vec2_transformers: Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

2 Likes

@lysandre has a far better solution to this issue. See this Github post can't allocate memory error with wav2vec2 · Issue #10366 · huggingface/transformers · GitHub

Code screen grab below

3 Likes

Thank you for sharing and happy the snippet helps you!

Thank you for this topic. I’m adding some of my experience and error messages I encountered, to help others find it via search functions.

My code was adapted from What is Automatic Speech Recognition? - Hugging Face

import os
import sys
import json
import requests
from dotenv import load_dotenv, find_dotenv


_ = load_dotenv(find_dotenv())
API_TOKEN = os.getenv('HUGGINGFACE_PLAY_READ')

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/facebook/wav2vec2-base-960h"


def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))


filename = sys.argv[1]

try:
    data = query(filename)
    print(f'{data["text"]}')
except:
    import pdb, sys
    e, m, tb = sys.exc_info()
    pdb.post_mortem(tb)
    pdb.set_trace()

When I run it with a 100-second mp3 mono clip, it works fine.
Note that the mp3 is actually sampled at 44100 Hz, not 16 kHz as the documentation requires. So I would guess that the quality might suffer.

When I use a 200-second clip, I get an error because data is just this:
{'error': 'Service Unavailable'}

When I use an 800-second clip, data == {'error': 'Model facebook/wav2vec2-base-960h is currently loading', 'estimated_time': 20.0}

Finally, at 1600 seconds, I get a clear response.content value of 'Payload reached size limit.', but still no information on what the size limit is…

I’d love to see the limits of the model more clearly documented, and/or tutorials that provide a more robust ASR function.