Spanish ASR: Fine-Tuning Wav2Vec2


I’m planning to fine-tune XLSR-Wav2Vec2 for Spanish, following the notebooks and methodology shared by @patrickvonplaten. See for example the Turkish one.

I’m just getting started and still getting a feel for the data I downloaded from Common Voice. I’ve seen sentences referring to words in other languages, for example:

"hay pues dos pronunciaciones posibles para 日本 nihon o nippon"
"más adelante los kanjis cambiaron a 蝦夷"

This increases the vocab substantially (230 characters after basic punctuation removal), and probably adds a lot of noise. Is this a scenario that occurs in other Common Voice datasets?

I’m currently considering the exclusion of these samples, but I haven’t yet measured how many of them there are in the dataset.


Hey @pcuenq,

Very good question & It’s great that you already took a look at the dataset. I think you have the following three options here:

  1. Remove all characters which clearly don’t belong to the Spanish languages from both the training and the test data.

  2. Don’t remove those characters from the training and test dataset, but remove them from the vocabulary. In this scenario make sure to add the "[UNK]" token to the vocab and define it as unk_token="[UNK]" when instantiating the tokenizer. This way the model will simply learn to classify all such tokens to [UNK], but you don’t have to significantly change the training data.

  3. Just add all such tokens to the vocab.

Obviously all methods have their advantages and disadvantages. I would tend to either option 2) or 1) with option 2) probably being my preferred option.

The reason is that removing all those characters might change the meaning of the sentence so that the not-removed part of the sentence doesn’t make that much sense anymore. Teaching the model to simply classify unknown sounds (symbol of other language) to unknown symbols (the "[UNK]" token representing the symbol of the other language) makes most sense here IMO. Also it shouldn’t really affect your final WER as the model would have in the most likely scenario not classified those symbols correctly anyways.

Option 1) is also very much possible by just fully removing those data samples. This should very much be feasible for high resource languages like Spanish.

Option 3) has the big advantage of making training slow and less stable. This option really only makes sense if each of the added characters occurs a significant amount in your training which I believe is not the case here.


Summary update of progress (tl;dr: not much) so far.

  • I took a look at the foreign characters mentioned in the previous message and there aren’t many. Listening to the samples, some of them are omitted in the speech, some are pronounced in a wrong way (for example, the Japanese character “の” is pronounced “sigma” by one speaker. I did a couple of translations that I thought could make sense and assigned the rest to [UNK].

  • After data exploration, I consolidated all dataset preparation tasks into a single map function, to prevent disk usage explosion (due to caches and temporal files) and reduce computation time. I also disabled caching and explicitly saved the pre-processed dataset at the points I’m interested.

  • After I set up the model and invoke train, it takes a long time for training to actually start. I’m talking hours when using the complete common voice dataset (training and validation samples together). I tried with and without a dataloader_num_workers training argument. Even with the dataloader_num_workers set to 16, I see a process doing something but the rest of the CPUs are idle. I don’t know what it’s doing, I’ll try to investigate.

  • Training in a local GPU with 24GB RAM (a 3090) I get CUDA out-of-memory errors after a few steps, when using a batch size of 32. I suppose some of the samples are longer and I got unlucky in one batch. I could also omit longer samples, does anyone have a feeling about a reasonable maximum duration?

  • Iteration and hill-climbing are going to be very slow because of the reasons above. I’m currently training on a subset with just 10% of the data using a batch size of 24, to see if that works. At just 3% progress I get ~1.30s/it. It feels slow, but I don’t really know how that compares. If that works, I plan to train a few epochs per 10% subset.

  • The OVH environment looks awesome (thanks a lot!), but the ephemeral disk space is not big enough to process the Spanish dataset. I’m thinking about mounting an additional block storage unit and upload my pre-processed dataset there somehow. Not sure how that works, I’ll take a look later.

In summary, processing a language with a relatively high amount of training data is harder than I expected. I’m most worried about the delay before training starts, I might have something misconfigured in my computer. Any hints about that, or strategies about dealing with huge datasets would be appreciated.

But it’s fun and a great learning experience :slight_smile:


Not working on Spanish but I think I can share my experience here.

I suppose some of the samples are longer and I got unlucky in one batch.

I tried the trick here, it didn’t help. And I found out the main cause of OOM error is evaluating. The Trainer class saves all prediction outputs in GPU for faster evaluating. You can set eval_accumulation_steps to a small value(60 for me with batch size 8 on a 3090), but then it moves tensor from GPU to CPUs and it’s slow as hell. In the end I settled down with using 10% of testset when training and do the full evaluation at the end, still it adds 1-2 hours of overhead.

After I set up the model and invoke train , it takes a long time for training to actually start. I’m talking hours when using the complete common voice dataset (training and validation samples together).

Having the same issue here, in my understanding data collating is very cpu intense so I tried increase the num_worker and it’s not helping. I hope someone else has answers.

Iteration and hill-climbing are going to be very slow because of the reasons above.

Maybe you can disable evaluation completely, that should speed you up:-)
Default setting from the demo notebook should get you to somewhere decent. And the wer metric is slow as well. Someone should do a profiling.


Thanks a lot for the hints! I’m also using 10% splits of the dataset for both training and testing, and evaluating on the full test set afterwards. Still, I’m having problems evaluating WER on the test dataset after a training run. I wrote a small helper function to compute wer from chunks, in CPU:

import jiwer

def chunked_wer(targets, predictions, chunk_size=None):
    if chunk_size is None: return jiwer.wer(targets, predictions)
    start = 0
    end = chunk_size
    H, S, D, I = 0, 0, 0, 0
    while start < len(targets):
        chunk_metrics = jiwer.compute_measures(targets[start:end], predictions[start:end])
        H = H + chunk_metrics["hits"]
        S = S + chunk_metrics["substitutions"]
        D = D + chunk_metrics["deletions"]
        I = I + chunk_metrics["insertions"]
        start += chunk_size
        end += chunk_size
    return float(S + D + I) / float(H + S + D)

# Your targets and predictions
target = result["target_text"]
preds = result["pred_strings"]

print("Total (chunk=5000), WER: {:2f}".format(100 * chunked_wer(target, preds, chunk_size=5000)))
1 Like

Regarding this problem, we had and impromptu debugging session in Slack with @PereLluis13 and @maxidl, and realized that the delay is being caused when grouping samples by length: the dataset is accessed sequentially in that case.

As a temporary solution, I implemented one of the ideas we discussed with @adilism: precompute the lengths of the samples and subclass Trainer to use them, if available:

# Pre-compute sample lengths
def input_lengths(example):
    example["length"] = len(example["input_values"])
    return example

# Adjust for your system
common_voice_train =, num_proc=num_proc)

## Use subclassed Trainer class to support pre-computed lengths

from transformers import Trainer
from transformers.trainer_pt_utils import LengthGroupedSampler, DistributedLengthGroupedSampler
from import DataLoader
import collections

class GroupedLengthsTrainer(Trainer):
    # length_field_name should possibly be part of TrainingArguments instead
    def __init__(self, length_field_name=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.length_field_name = length_field_name
    def _get_train_sampler(self) -> Optional[]:
        if isinstance(self.train_dataset, or not isinstance(
            return None

        # Build the sampler.
        if self.args.group_by_length:
            lengths = self.train_dataset[self.length_field_name] if self.length_field_name is not None else None
            model_input_name = self.tokenizer.model_input_names[0] if self.tokenizer is not None else None
            if self.args.world_size <= 1:
                return LengthGroupedSampler(
                    self.train_dataset, self.args.train_batch_size, lengths=lengths, model_input_name=model_input_name
                return DistributedLengthGroupedSampler(

            return super()._get_train_sampler()

# Build trainer indicating the name of the field that contains the lengths
trainer = GroupedLengthsTrainer(

This is ugly because we are overriding a private method. In addition, a better place to indicate the field name to use for sorting would possibly be TrainingArguments, but then we’d have to subclass or wrap that one too. But it can get the job done until the issue referenced above is discussed and resolved.


Good debugging and nice solution! I believe we can add a check for a lengths column in the dataset like you did to speed this up.


Another advantage of having length is that you can sort your validation set by this column and that will make evaluation much faster by minimising padding

1 Like

Would you supply the column name through TrainerArguments or just hardcode it? Either way, let me know if you’d like a PR. I’m new to this code base, but this looks easy enough :slight_smile:

1 Like

I think having a new TrainingArguments called length_field_name that defaults to "length" would be the best. If you want to make a PR, by all mean go ahead! The only thing to add in your _get_train_sampler is some test of whether the dataset comes from the datasets library (like is_datasets_available() and isinstance(self.train_dataset, datasets.Dataset))