Wav2vec2.0 memory issue

Hi @patrickvonplaten, I am trying to fine-tune XLSR-Wav2Vec2. Data contains more than 900k sound, it is huge. In this case, I always receive out of memory, even batch size is 2 (gpu = 24gb). When I take a subset (100 sound) and fine-tune on this subset, everything is fine. What could be the problem? Is there any issue which is related to loading data to memory?

I think it should not depend how much bigger data is when batch size is same.


Hey @EmreOzkose do you train locally or in a google colab? Also do you get hard disk out-of-memory errors or RAM out-of-memory?

Feel free to share your fine-tuning script here so that I can take a look

I am training locally. I have 24gb gpu. Error is RuntimeError: CUDA out of memory. Tried to allocate 562.00 MiB (GPU 1; 23.65 GiB total capacity; 0 bytes already allocated; 540.44 MiB free; 0 bytes reserved in total by PyTorch) (in this case, batch is 2 and data is huge).

I also tried common voice script and the problem arise again.

My script is same with Turkish Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with :hugs: Transformers Blog except paths and resampling (my data contains 16k sounds).

In addition to that, I used directly load_dataset(path/to/csv) function to load data since I used my own data.

# Finetune Script for Wav2vec2 Hugging Face
# 15 March
# https://huggingface.co/blog/fine-tune-xlsr-wav2vec2
# usage:

import os
os.environ['TRANSFORMERS_CACHE'] = '/path/to/wav2vec2_finetune/cache'
os.environ['PYTORCH_TRANSFORMERS_CACHE'] = '/path/to/wav2vec2_finetune/cache'
os.environ['HF_DATASETS_CACHE'] = '/path/to/wav2vec2_finetune/cache'

import torch
from datasets import load_dataset, load_metric
import random
import pandas as pd
import re
import json
from transformers import Wav2Vec2CTCTokenizer
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2Processor
import torchaudio
import librosa
import numpy as np
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
from transformers import Wav2Vec2ForCTC
from transformers import TrainingArguments
from transformers import Trainer

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)

    df = pd.DataFrame(dataset[picks])

def remove_special_characters(batch):
    batch["text"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() + " "
    return batch

def extract_all_chars(batch):
    all_text = " ".join(batch["text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = speech_array[0].numpy()
    batch["sampling_rate"] = sampling_rate
    batch["target_text"] = batch["text"]
    return batch

def prepare_dataset(batch):
    # check that all files have the correct sampling rate
    assert (
            len(set(batch["sampling_rate"])) == 1
    ), f"Make sure all inputs have the same sampling rate of {processor.feature_extractor.sampling_rate}."

    batch["input_values"] = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0]).input_values

    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

class DataCollatorCTCWithPadding:
    Data collator that will dynamically pad the inputs received.
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

if __name__ == "__main__":
    device = "cuda:0"

    # https://huggingface.co/docs/datasets/loading_datasets.html
    common_voice_train = load_dataset('csv', data_files='/path/to/train.csv', split="train")
    common_voice_test = load_dataset('csv', data_files='/path/to/tr/test.csv', split="train")



    # pre-processing
    chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'

    common_voice_train = common_voice_train.map(remove_special_characters, remove_columns=["sentence"])
    common_voice_test = common_voice_test.map(remove_special_characters, remove_columns=["sentence"])
    print("Done special character mapping")

    vocab_train = common_voice_train.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True,
    vocab_test = common_voice_train.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True,

    vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
    vocab_dict = {v: k for k, v in enumerate(vocab_list)}

    vocab_dict["|"] = vocab_dict[" "]
    del vocab_dict[" "]

    vocab_dict["[UNK]"] = len(vocab_dict)
    vocab_dict["[PAD]"] = len(vocab_dict)
    print("length of vocab: {}".format(len(vocab_dict)))

    with open('vocab.json', 'w', encoding="utf-8") as vocab_file:
        json.dump(vocab_dict, vocab_file)

    tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
    print("VOCAB: {}".format(tokenizer.get_vocab()))

    feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True,
    processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

    common_voice_train = common_voice_train.map(speech_file_to_array_fn, remove_columns=common_voice_train.column_names)
    common_voice_test = common_voice_test.map(speech_file_to_array_fn, remove_columns=common_voice_test.column_names)

    # common_voice_train = common_voice_train.map(resample, num_proc=4)
    # common_voice_test = common_voice_test.map(resample, num_proc=4)

    # check data if it is created correctly
    rand_int = random.randint(0, len(common_voice_train)-1)

    print("Target text:", common_voice_train[rand_int]["target_text"])
    print("Input array shape:", np.asarray(common_voice_train[rand_int]["speech"]).shape)
    print("Sampling rate:", common_voice_train[rand_int]["sampling_rate"])

    common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names,
                                                batch_size=2, num_proc=8, batched=True)
    common_voice_test = common_voice_test.map(prepare_dataset, remove_columns=common_voice_test.column_names,
                                              batch_size=2, num_proc=8, batched=True)

    data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
    wer_metric = load_metric("wer")

    model = Wav2Vec2ForCTC.from_pretrained(

    training_args = TrainingArguments(

    trainer = Trainer(
    print("Starting training...")
    print("training is finished")

    # pretrained_model_path = "./wav2vec2-large-xlsr-turkish-demo"
    # model = Wav2Vec2ForCTC.from_pretrained(pretrained_model_path).to(device)
    # processor = Wav2Vec2Processor.from_pretrained(pretrained_model_path)

    input_dict = processor(common_voice_test["input_values"][0], return_tensors="pt", padding=True)
    logits = model(input_dict.input_values.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)[0]


    reverse_vocab = {j: i for i, j in vocab_dict.items()}

    print("".join([reverse_vocab[i] for i in common_voice_test["labels"][0]]))

How long are your input samples? E.g. when you print:


what number do you get on average, max, min for your dataset?

It might be that your data samples are very long

Since it takes too long to load data, It might be helpful to share normalized histogram of my dataset.

normalized number of sample list:
[‘0.1512’, ‘1.0000’, ‘0.8367’, ‘0.8265’, ‘0.6045’, ‘0.2867’, ‘0.1256’, ‘0.0611’, ‘0.0330’, ‘0.0189’, ‘0.0103’, ‘0.0057’, ‘0.0034’, ‘0.0017’, ‘0.0011’, ‘0.0006’, ‘0.0004’, ‘0.0002’, ‘0.0001’]

corresponding seconds:
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19]

I have sounds which are more than 19sn. It can be a problem. I think padding is done in batch (default behavior), so each batch have different shape. First batch may have long duration. I might check a subset which is restricted to less than 6sn.

1 Like

This code test max sample in all dataset. Maybe this help with you.

def preallocate_memory_trick(self, model: nn.Module):
if self.deepspeed:
return # finding the longest input_values and labels in the dataset
# generate this randomly needs to infer dtype of inputs, so...
input_values = max(self.train_dataset, key=lambda x: len(x['input_values']))['input_values']
labels = max(self.train_dataset, key=lambda x: len(x['labels']))['labels']
inputs = {
"input_values": torch.Tensor(input_values).repeat(self.args.train_batch_size, 1),
"labels": torch.Tensor(labels).repeat(self.args.train_batch_size, 1)
self.training_step(model, inputs)


Thanks @gorodecki and @patrickvonplaten. I removed sounds which are longer than 6sn. It works now. :partying_face:

1 Like

Thanks for question and answer.

For those having issue, you can try following function to remove data that is longer than by default 6 seconds on common_voice_test and common_voice_train .
As I already processed and saved I remove long data just before train (it is pretty fast for i7 16Gb)

def remove_long_common_voicedata(dataset, max_seconds=6):

  #convert pyarrow table to pandas

  dftest= dataset.to_pandas()

  #find out length of input_values

  dftest['len']= dftest['input_values'].apply(len)

  #for wav2vec training we already resampled to 16khz

  #remove data that is longer than max_seconds (6 seconds ideal)

  maxLength = max_seconds*16000 

  dftest= dftest[dftest['len']<maxLength]

  dftest = dftest.drop('len', 1)

  #convert back to pyarrow table to use in trainer

  dataset= dataset.from_pandas(dftest)

  #directly remove do not wait for gc

  del dftest

  return dataset
  • Also if you trained and it failed if you change something and restart training Cuda may give out of memory so before defining model and trainer, you can make sure you have more memory.

    import gc
    #do below before defining model and trainer if you change batch size etc
    #del trainer
    #del model

  • I also needed to set group_by_length to False as it hogged up memory initially, group_by_length=False , reduced batch size to 4 in TrainingArguments (RTX2070 8GB)


Hi @patrickvonplaten,
I’am following that notebook of you (i think the same of the post) : https://huggingface.co/blog/fine-tune-xlsr-wav2vec2
My machine is local + physical and equipped with Intel(R) Core™ i9-9900K CPU @ 3.60GHz + 32GB of RAM + GeForce RTX 2060 which i think is uncapable to load the pretrained model wav2vec2-large-xlsr-53 because the vram is insufficient. I have followed that thread because the error obtained look similar:

trainer.train RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 5.79 GiB total capacity; 4.68 GiB already allocated; 1.44 MiB free; 4.76 GiB reserved in total by PyTorch)

I’am using the standard turquish common_voice resource is in essence a copy paste of your code in the notebook without the text explanations, whole work the same as long as the train fail because vram is insufficient.

I have tried decreasing per_device_train_batch_size value on the TrainingArguments and adding that following suggested code in other threads i read before without success

import gc

Also eliminating audios of duration greater than 6 seconds with this code:

from mutagen.mp3 import MP3

def mutagen_length(path):

  • try:*
  •    audio = MP3(path)*
  •    length = audio.info.length*
  •    return length*
  • except:*
  •    return None*

def _saca_chantas(datos):

  • if (mutagen_length(datos[“path”]) > 6):*
  •    return False*
  • else:*
  •    return True*

common_voice_train = common_voice_train.filter(_saca_chantas)

Changing the pretrain model with this smaller one: wav2vec2-base, it success!.

I want know which are the minimal requirements of the video card in order to load the pretrain model wav2vec2-large-xlsr-53 to acquire the appropriate one, if you can recommend a models of suitable nvidia card i could buy and/or a way to make the one i have Nvidia GeForce RTX 2060 do the work without abort with vram exausted?

Thank’s for your time reading that, your notebook are a great helpful resource, if you can make a recommendation i should be very grateful.


Hey @DanielPezoa,

A 32 GB GPU should be big enough to fine-tune the model actually … do you use config.gradient_checkpointing=True?

Also, it would be interesting at what batch_size you are able to fine-tune the model => setting the batch size to 8 in combination with gradient checkpointing should definitely work

1 Like

i want to train local how i can organize data and what’s mean
train.csv & test.csv
where i can fined

Each row in columns contains ‘path’ and ‘sentence’. It was my own dataset, not publically available.


Hey! I’m trying to finetune the W2v2 model on my own dataset (100K audio) merged with the French CV dataset (total of 89Go) but the preprocess is killed after 44%. I tried to remove the lengthiest audio and it works (<=3s) but what I am not understanding is that I have 128G of RAM it should be enough no?

Thank’s for your reply