[Open-to-the-community] Whisper fine-tuning event

reach-vb · November 25, 2022, 12:27pm

Hey hey!

We are on a mission to democratise speech, increase the language coverage of current SoTA speech recognition and push the limits of what is possible. Come join us from December 5th - 19th for a community sprint powered by Lambda. Through this sprint, we’ll cover 70+ languages, 39M - 1550M parameters & evaluate our models on real-world evaluation datasets.

Register your interest via the Google form here.

What is the sprint about

The goal of the sprint is to fine-tune Whisper in as many languages as possible and make them accessible to the community. We hope that especially low-resource languages will profit from this event.

The main components of the sprint consist of:

Open AI’s state-of-the-art Whisper model
Public datasets like Common Voice 11, VoxPopuli, CoVoST2 and more
Real-world audio for evaluation

How does it work

Participants have two weeks to fine-tune Whisper checkpoints in as many languages as they want. The end goal is to build robust language-specific models that generalise well with real-world data. In general, the model repository on the Hugging Face hub should consist of:

Fine-tuned Whisper checkpoint (e.g. Whisper-large)
Evaluation script for your fine-tuned checkpoint
Hugging Face space to demo your fine-tuned model

The best part is that we’ll provide fine-tuning, evaluation and demo scripts for you to focus on the model performance.

During the event, you will have the opportunity to work on each of these components to build speech recognition systems in your favourite language!

Each Whisper checkpoint will automatically be evaluated on real-world audio (if available for the language). After the fine-tuning week, the best-performing systems of each language will receive SWAG.

What do I need to do to participate

To participate, simply fill out this short google form . You will also need to create a Hugging Face Hub account here and join our discord here - Make sure to head over to #role-assignment and click on ML for Audio and Speech.

This fine-tuning week should be especially interesting to native speakers of low-resource languages. Your language skills will help you select the best training data, and possibly build the best existing speech recognition system in your language.

More details will be announced in the discord channel. We are looking forward to seeing you there!

What do I get

learn how to fine-tune state-of-the-art Whisper speech recognition checkpoints
free compute to build a powerful fine-tuned model under your name on the Hub
hugging face SWAG if you manage to build the best-performing model in a language
more GPU hours if you manage to have the best-performing model in a language

Open-sourcely yours,

Sanchit, VB & The HF Speech Team

raulkite · November 25, 2022, 3:32pm

I’m using it in Spanish and works really well. So maybe not needed a fine tunning for me… but

Is possible to get help to improve timestamps of whisper and work in that field?

Thanks

josearangos · November 25, 2022, 11:08pm

@raulkite you can share me the code of how to use whisper with Spanish audios, please.

raulkite · November 26, 2022, 5:27am

No difference between English or Spanish. If your audio is Spanish your transcription will be Spanish.

raulkite · November 26, 2022, 6:21am

Hi,

I need a good timestamp er word accuracy with the transcription of whisper

I have seen that fine tunning whisper with hugging face seems easy for other languages so I have thought that maybe to have better accuracy is a feasible task this way.

It could be “easy” to create a dataset with aligned long audios with tools like Gentle( GitHub - lowerquality/gentle: gentle forced aligner )
I have experience with this.

Also add some layers in the top of the model to train this new output seems possible.

Is there anyone working with this? I’m wrong?

If someone is working on this please ping me.

Thanks

psk · December 2, 2022, 6:19pm

I would rather want to improve the translation from my language[telugu] to english. can we finetune the translation at this point? if yes where to specify it ? thanks

pierreguillou · December 5, 2022, 12:39pm

Hi,

I tried to fine-tune Whisper (all model sizes) with the event python script (run_speech_recognition_seq2seq_streaming.py) and the following code from the event page on Lambda GPU (and Google Colab with the Whisper tiny model) but it failed because of 2 errors.

I found a solution for the first one but not for the second one.

echo 'python run_speech_recognition_seq2seq_streaming.py \
	--model_name_or_path="openai/whisper-small" \
	--dataset_name="mozilla-foundation/common_voice_11_0" \
	--dataset_config_name="es" \
	--language="spanish" \
	--train_split_name="train+validation" \
	--eval_split_name="test" \
	--model_index_name="Whisper Small Spanish" \
	--max_steps="5000" \
	--output_dir="./" \
	--per_device_train_batch_size="64" \
	--per_device_eval_batch_size="32" \
	--logging_steps="25" \
	--learning_rate="1e-5" \
	--warmup_steps="500" \
	--evaluation_strategy="steps" \
	--eval_steps="1000" \
	--save_strategy="steps" \
	--save_steps="1000" \
	--generation_max_length="225" \
	--length_column_name="input_length" \
	--max_duration_in_seconds="30" \
	--text_column_name="sentence" \
	--freeze_feature_encoder="False" \
	--report_to="tensorboard" \
	--gradient_checkpointing \
	--fp16 \
	--overwrite_output_dir \
	--do_train \
	--do_eval \
	--predict_with_generate \
	--do_normalize_eval \
	--use_auth_token \
	--push_to_hub' >> run.sh

First error (during training)

"use_cache=True" is incompatible with gradient checkpointing. Setting "use_cache=False"...

The correction must be done in line 393. Then, the new config is the following one:

config = AutoConfig.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
        use_cache=False if training_args.gradient_checkpointing,
    )

Second error (during evaluation)

Traceback (most recent call last):
  File "run_speech_recognition_seq2seq_streaming.py", line 607, in <module>
    main()
  File "run_speech_recognition_seq2seq_streaming.py", line 556, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/mwpt/lib/python3.8/site-packages/transformers/trainer.py", line 1527, in train
    return inner_training_loop(
  File "/home/ubuntu/mwpt/lib/python3.8/site-packages/transformers/trainer.py", line 1852, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ubuntu/mwpt/lib/python3.8/site-packages/transformers/trainer.py", line 2115, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/ubuntu/mwpt/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 78, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/home/ubuntu/mwpt/lib/python3.8/site-packages/transformers/trainer.py", line 2811, in evaluate
    output = eval_loop(
  File "/home/ubuntu/mwpt/lib/python3.8/site-packages/transformers/trainer.py", line 3096, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "run_speech_recognition_seq2seq_streaming.py", line 509, in compute_metrics
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
  File "/home/ubuntu/mwpt/lib/python3.8/site-packages/evaluate/module.py", line 444, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/ubuntu/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--wer/85bee9e4216a78bb09b2d0d500f6af5c23da58f9210e661add540f5df6630fcd/wer.py", line 103, in _compute
    measures = compute_measures(reference, prediction)
  File "/home/ubuntu/mwpt/lib/python3.8/site-packages/jiwer/measures.py", line 179, in compute_measures
    raise ValueError("one or more groundtruths are empty strings")
ValueError: one or more groundtruths are empty strings

How to solve this issue?

pierreguillou · December 6, 2022, 11:30am

About "use_cache=True" is incompatible with gradient checkpointing. Setting "use_cache=False"..., @sanchit-gandhi answered here.

Hope he can answer as well the second error (see my post) when the script (run_speech_recognition_seq2seq_streaming.py) launches the evaluation mode.

steja · December 6, 2022, 1:45pm

When running Trainer.train()
Has anyone come across this before?

    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`

(next(iter(train_dataset_loader))['input_features'].shape)

gives torch.Size([64, 80, 3000]).

sanchit-gandhi · December 6, 2022, 4:29pm

Hey @pierreguillou! This error we’re getting means that one of our label strings (references or ground truths) is empty. You can reproduce this error using the following code snippet. Here, the second sample has an empty ground truth:

from evaluate import load

label_str = ["the cat", "", "sat on"]
pred_str = ["the dog", "sat", "sit on"]

wer = load("wer")
print(wer.compute(references=label_str, predictions=pred_str))

ValueError: one or more groundtruths are empty strings

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

Input In [1], in <cell line: 7>()

**4** pred_str = ["the dog", "sat", "sit on"]

**6** wer = load("wer")

----> 7 print(wer.compute(references=label_str, predictions=pred_str))

File ~/venv/lib/python3.8/site-packages/evaluate/module.py:444, in EvaluationModule.compute(self, predictions, references, **kwargs)

**442** inputs = {input_name: self.data[input_name] **for** input_name **in** self._feature_names()}

**443** **with** temp_seed(self.seed):

--> 444 output = self._compute(**inputs, **compute_kwargs)

**446** **if** self.buf_writer **is** **not** **None**:

**447** self.buf_writer = **None**

File ~/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--wer/85bee9e4216a78bb09b2d0d500f6af5c23da58f9210e661add540f5df6630fcd/wer.py:103, in WER._compute(self, predictions, references, concatenate_texts)

**101** total = 0

**102** **for** prediction, reference **in** zip(predictions, references):

--> 103 measures = compute_measures(reference, prediction)

**104** incorrect += measures["substitutions"] + measures["deletions"] + measures["insertions"]

**105** total += measures["substitutions"] + measures["deletions"] + measures["hits"]

File ~/venv/lib/python3.8/site-packages/jiwer/measures.py:206, in compute_measures(truth, hypothesis, truth_transform, hypothesis_transform, **kwargs)

**204** hypothesis = [hypothesis]

**205** **if** any(len(t) == 0 **for** t **in** truth):

--> 206 **raise** **ValueError**("one or more groundtruths are empty strings")

**208** *# Preprocess truth and hypothesis*

**209** truth, hypothesis = _preprocess(

**210** truth, hypothesis, truth_transform, hypothesis_transform

**211** )

ValueError: one or more groundtruths are empty strings

Why is this a problem? Because the WER normalises by the number of reference words (WER - a Hugging Face Space by evaluate-metric). If this is zero, we risk a divide by zero error.

We can add a quick filtering step to only evaluate the samples that correspond to non-zero references:

pred_str = [pred_str[i] for i in range(len(pred_str)) if len(label_str[i]) > 0]
label_str = [label_str[i] for i in range(len(label_str)) if len(label_str[i]) > 0]

print(wer.compute(references=label_str, predictions=pred_str))

Print Output:

0.5

You can add these two lines in your compute_metrics function in the Python training script. I’ll add these to main as well now.

Hope that helps!

sanchit-gandhi · December 6, 2022, 4:39pm

Hey @steja! Ooof this is a new one! I’m confident we’ll be able to get to the bottom of it!

Could you run the following:

transformers-cli env

And paste the output here?

Could you also then run:

nvidia-smi

And paste that output too?

I think it could be a CUDA GPU / PyTorch issue

steja · December 6, 2022, 4:45pm

@sanchit-gandhi
from transformers-cli env :

transformers version: 4.26.0.dev0
Platform: Linux-3.10.0-862.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.10.8
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.13.0+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No, notebook
Using distributed or parallel set-up in script?: multi gpu is available, but the issue persists in both single/multi modes

From nvidia-smi:

allandclive · December 6, 2022, 5:03pm

Trying to fine tune on a language called luganda with common voice id of “lg”. tried modifying the colab notebook by I this error message:

ValueError: Unsupported language: luganda. Language should be one of: [‘english’, ‘chinese’, ‘german’, ‘spanish’, ‘russian’, ‘korean’, ‘french’, ‘japanese’, ‘portuguese’, ‘turkish’, ‘polish’, ‘catalan’, ‘dutch’, ‘arabic’, ‘swedish’, ‘italian’, ‘indonesian’, ‘hindi’, …

Where is the problem?

steja · December 6, 2022, 5:16pm

If you are using the colab notebook ,

This worked for me (I switched off the streaming to check if i was able to load, but you can switch it on):
But, i dont think whisper processor supports Luganda (correct me)?

from datasets import interleave_datasets, load_dataset

def load_streaming_dataset(dataset_name, dataset_config_name, split, **kwargs):
    if "+" in split:
        # load multiple splits separated by the `+` symbol *with* streaming mode
        dataset_splits = [load_dataset(dataset_name, dataset_config_name, split=split_name, streaming=False, **kwargs) for split_name in split.split("+")]
        # interleave multiple splits to form one dataset
        interleaved_dataset = interleave_datasets(dataset_splits)
        return interleaved_dataset
    else:
        # load a single split *with* streaming mode
        dataset = load_dataset(dataset_name, dataset_config_name, split=split, streaming=False, **kwargs)
        return dataset

from datasets import IterableDatasetDict

raw_datasets = IterableDatasetDict()

raw_datasets["train"] = load_streaming_dataset("common_voice", "lg", split="train", use_auth_token=False)
#load_streaming_dataset("mozilla-foundation/common_voice_11_0", "es", split="train", use_auth_token=False)  # set split="train+validation" for low-resource
raw_datasets["test"] = load_streaming_dataset("common_voice", "lg", split="test", use_auth_token=False)
#load_streaming_dataset("mozilla-foundation/common_voice_11_0", "es", split="test", use_auth_token=False)

allandclive · December 6, 2022, 5:46pm

same error message. It’s unsupported

allandclive · December 7, 2022, 10:34am

Looking for ways to successfully fine tune on “Luganda” language, “lg”- common voice id. Whisper seems not to support the language. Any ideas are welcome.

sanchit-gandhi · December 7, 2022, 12:20pm

Hey @allandclive! Could you first update transformers to main?

pip uninstall transformers
pip install git+https://github.com/huggingface/transformers

This will make it easier to select languages through either the language name or language id code

My advice would be to set the language to the closest language to Luganda.

I’m right in saying that Swahili is similar to Luganda? We can try setting the language to Swahili!

steja · December 7, 2022, 6:08pm

@sanchit-gandhi @reach-vb
While training on openslr and evaluating on fleurs, Looks like there is a size mismatch, while evalution.
Did, Anyone came across this ?

-> 1527 return inner_training_loop(
   1528     args=args,
   1529     resume_from_checkpoint=resume_from_checkpoint,
   1530     trial=trial,
   1531     ignore_keys_for_eval=ignore_keys_for_eval,
   1532 )

File ~/miniconda3/envs/env_whisper/lib/python3.10/site-packages/transformers/trainer.py:1852, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1849     self.state.epoch = epoch + (step + 1) / steps_in_epoch
   1850     self.control = self.callback_handler.on_step_end(args, self.state, self.control)
-> 1852     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
   1853 else:
   1854     self.control = self.callback_handler.on_substep_end(args, self.state, self.control)

File ~/miniconda3/envs/env_whisper/lib/python3.10/site-packages/transformers/trainer.py:2115, in Trainer._maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
...
    return forward_call(*input, **kwargs)
  File /home/miniconda3/envs/env_whisper/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py", line 869, in forward
    hidden_states = inputs_embeds + positions
RuntimeError: The size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1

sanchit-gandhi · December 8, 2022, 11:43am

Hey @steja! This is pretty unlucky It means that we have a sample with 504 tokens in our training set, but the model has a max length of 448. Could you add an extra filter step to your dataset before you instantiate the Trainer:

max_label_length = model.config.max_length

def filter_labels(labels):
    """Filter label sequences longer than max length"""
    return len(labels) < max_label_length

vectorized_datasets = vectorized_datasets.filter(filter_labels, input_columns=["labels"])

This should fix the issue!

pierreguillou · December 8, 2022, 11:48am

Hi @sanchit-gandhi. Thank you for your answer but I do not see the changes in the compute_metrics() function. Can you confirm that you will do it?

Topic		Replies	Views
[Open-to-the-community] XLSR-Wav2Vec2 Fine-Tuning Week for Low-Resource Languages Languages at Hugging Face	411	17416	December 9, 2021
Has Anyone Successfully Fine-Tuned Whisper for a Local Language for better accuracy Beginners	5	197	May 27, 2025
[Open-to-the-community] Robust Speech Recognition Challenge Languages at Hugging Face	24	12487	January 29, 2022
Weights & Biases supporting Whisper Fine-tuning :partying_face: Community Calls	4	644	December 9, 2022
Fine-tuning Whisper for Audio Classification Models	6	3257	November 8, 2024