File size/speech length limit for Wave2Vec2?

chinhon · February 14, 2021, 4:12pm

Hi there. I’ve been trying out Hugging Face’s implementation of Wave2Vec2 for transcribing on Colab Pro, and got pretty good results from short speeches under 80 seconds. Anything beyond that just crashes the notebook, even when I set it to High RAM, or compress the audio file drastically.

Is there a practical limit to the length of the audio clip that can/should be run on HF-Wav2Vec2? I tried looking for documentation on this, but might have missed it.

Appreciate any pointers on this.

chinhon · February 26, 2021, 3:54am

Answering my own question in case anyone stumbles on this and wants a quick solution: seems like it’s a memory issue. I cobbled together a simple if clumsy way to transcribe the split-up clips one at a time. See attached screen-grab or check out the notebooks in my repo for this project: GitHub - chuachinhon/wav2vec2_transformers: Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

chinhon · March 19, 2021, 2:59am

@lysandre has a far better solution to this issue. See this Github post can't allocate memory error with wav2vec2 · Issue #10366 · huggingface/transformers · GitHub

Code screen grab below

lysandre · March 19, 2021, 6:08pm

Thank you for sharing and happy the snippet helps you!

nealmcb · June 24, 2023, 4:25pm

Thank you for this topic. I’m adding some of my experience and error messages I encountered, to help others find it via search functions.

My code was adapted from What is Automatic Speech Recognition? - Hugging Face

import os
import sys
import json
import requests
from dotenv import load_dotenv, find_dotenv


_ = load_dotenv(find_dotenv())
API_TOKEN = os.getenv('HUGGINGFACE_PLAY_READ')

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/facebook/wav2vec2-base-960h"


def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))


filename = sys.argv[1]

try:
    data = query(filename)
    print(f'{data["text"]}')
except:
    import pdb, sys
    e, m, tb = sys.exc_info()
    pdb.post_mortem(tb)
    pdb.set_trace()

When I run it with a 100-second mp3 mono clip, it works fine.
Note that the mp3 is actually sampled at 44100 Hz, not 16 kHz as the documentation requires. So I would guess that the quality might suffer.

When I use a 200-second clip, I get an error because data is just this:
{'error': 'Service Unavailable'}

When I use an 800-second clip, data == {'error': 'Model facebook/wav2vec2-base-960h is currently loading', 'estimated_time': 20.0}

Finally, at 1600 seconds, I get a clear response.content value of 'Payload reached size limit.', but still no information on what the size limit is…

I’d love to see the limits of the model more clearly documented, and/or tutorials that provide a more robust ASR function.

Topic		Replies	Views
Wav2vec2 for long audiofiles Beginners	2	4122	March 18, 2022
Wav2Vec 2 audio processing Models	0	139	June 3, 2024
Support for ASR inference on longer audiofiles or on live transcription? 🤗Transformers	2	473	April 21, 2023
Hugging face model not transcribing the entire length of the audio file Beginners	0	515	August 7, 2023
Wav2vec2-xls-r-2b-22-to-16 sample code not running Models	1	696	March 18, 2022

File size/speech length limit for Wave2Vec2?

Related topics