Language detection with Whisper

Jotanner · November 13, 2022, 11:53am

The original whisper model supports dynamically detecting the language of input text, either by default as part of its model.transcribe() method or by doing something like this

mel = whisper.log_mel_spectrogram(audio).to(model.device)
_, probs = model.detect_language(mel)

It looks like the Transformers implementation supports setting the language on the WhisperTokenizer/WhisperProcessor, but I’m wondering if there’s an equivalent language detection method.

I looked around the modeling code and didn’t see anything related to language detection, although it looks like the language is set by just force decoding with a language token at the beginning of the output, and looking at the OpenAI implementation it looks like the language detection method is just taking a single decoding step and returning probabilities for all the language tokens.

This makes me think that this wouldn’t be hard to implement myself by taking one decoding step with the Transformers model and the pulling out probabilities for language tokens, but I’m wondering if there’s a better way to do this or a plan to implement an official language detection method.

Edit: Okay I kind of answered my own question while I was doing the research to ask it, and I recognize that decoding normally without forcing any language-related tokens basically means the model is doing language detection while it decodes, but I’m still wondering if there’s some specific language detection method I missed or I should just use the logits from the first decoding step.

Jotanner · November 14, 2022, 1:14am

In case anyone else comes looking to try and do the same thing, here’s what I implemented to do language detection:

def detect_language(model: WhisperForConditionalGeneration, tokenizer: WhisperTokenizer, input_features,
                    possible_languages: Optional[Collection[str]] = None) -> List[Dict[str, float]]:
    # hacky, but all language tokens and only language tokens are 6 characters long
    language_tokens = [t for t in tokenizer.additional_special_tokens if len(t) == 6]
    if possible_languages is not None:
        language_tokens = [t for t in language_tokens if t[2:-2] in possible_languages]
        if len(language_tokens) < len(possible_languages):
            raise RuntimeError(f'Some languages in {possible_languages} did not have associated language tokens')

    language_token_ids = tokenizer.convert_tokens_to_ids(language_tokens)

    # 50258 is the token for transcribing
    logits = model(input_features,
                   decoder_input_ids = torch.tensor([[50258] for _ in range(input_features.shape[0])])).logits
    mask = torch.ones(logits.shape[-1], dtype=torch.bool)
    mask[language_token_ids] = False
    logits[:, :, mask] = -float('inf')

    output_probs = logits.softmax(dim=-1).cpu()
    return [
        {
            lang: output_probs[input_idx, 0, token_id].item()
            for token_id, lang in zip(language_token_ids, language_tokens)
        }
        for input_idx in range(logits.shape[0])
    ]

aarteaga · December 14, 2022, 9:32pm

Thank you very much, it seems really interesnting. Can you share an example?
I try to execute your function buyt error arises.
Do you think is possible to create a function that translates to Spanish?
Thank You very much!

processor = WhisperProcessor.from_pretrained(“openai/whisper-tiny.en”)
model = WhisperForConditionalGeneration.from_pretrained(“openai/whisper-tiny.en”)
tokenizer = WhisperTokenizer.from_pretrained(“openai/whisper-tiny.en”)
input_features = {“Hola”,“como”,“estás”}
detect_language(model, tokenizer,input_features)

Jotanner · December 15, 2022, 2:57am

Whisper is a model for processing audio, so your input features need to be audio that has been processed by the WhisperProcessor. For me this looks something like this:

    waveform, sample_rate = torchaudio.load(str(audio_path))
    input_features = processor(waveform.squeeze().numpy(), sampling_rate=sample_rate,
                               return_tensors="pt").input_features

    language = detect_language(model, tokenizer, input_features, {'en', 'zh'})

For translating to Spanish, you’ll want to use a translation model, like MarianMT

aarteaga · December 15, 2022, 7:36am

Clear! Thank you very much! I thougth Whisper could work with input of text too. I thought that somehow in the logic of the algorithm, first there should be a phase of converting the audio to text and in a second phase, the detection or translation of the text.

ankurdhuriya · April 1, 2023, 1:26pm

How can we use this in a scenario where there is a speech is multi lingual, so language keep changing during the whole speech

Jotanner · April 2, 2023, 4:25am

That’s a tricky question. You could try something like splitting the speech up into segments and generating language tokens for each segment, but unfortunately I don’t know a silver bullet solution for this.

Whisper does a decent job recognizing multilingual speech out of the box (and even better with fine-tuning) though, so you could also just try and recognize all the speech and then segment the text based on language. That might be easier.

soheilasoheilasoheil · April 11, 2023, 12:37pm

Is there a way that I can limit whisper’s language identification choices? My audio files contain three languages. Currently, I am chunking my audio files in 3 seconds, and feeding to whisper and getting the language ID. However, It sometimes detect another language which is not in the file at all!

So, I was thinking of limiting whisper’s choice. Or somehow using whisper’s features to do some post processing to have more accurate result.

What I want to do is to determine the exact time of language switch. If u know another method that has a smaller resolution (smaller than 3 seconds) let me know

Megatron17 · September 7, 2023, 3:49pm

is there a way for longer audio length (5 mins)… or i need to manually split it and take the inference and then average it out across splitting?

ReatKay · September 10, 2023, 11:36am

use the pipeline . The pipeline does automatically chunk audio longer than 30 seconds (or any lower value you set) including a definable overlapping lenght.

transpipe = pipeline(
            "automatic-speech-recognition",
            model="openai/whisper-medium",
            chunk_length_s=30  # 30 seconds is the maximum length,
            device="cuda",
            stride_length_s=5 # the overlapping audio length
)

transcription = transpipe("my60minuteFile.wav")

Megatron17 · September 11, 2023, 1:04am

No I need to identify language and not run transcription, so i have 5 mins long audio, so i need 10 inferences over it.

ReatKay · September 11, 2023, 10:40am

Sorry, I was just reading your reply, from that it wasn’t clear - sorry again.

loretoparisi · June 5, 2024, 5:43pm

Doing like

processor = WhisperProcessor.from_pretrained("openai/whisper-base")
        model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
        tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-base")

        input_features = processor(audio, return_tensors="pt", sampling_rate=16000).input_features
language_name = detect_language_tokens(model, tokenizer, input_features, {'en', 'zh'})

I get the error

AttributeError: 'Tensor' object has no attribute 'additional_special_tokens'

loretoparisi · June 5, 2024, 6:50pm

I solved it using max_new_tokens=1 in model.generate

        model_path = 'openai/whisper-large-v3'

        # Load the pre-trained Whisper model and processor from Hugging Face
        processor = WhisperProcessor.from_pretrained(model_path)
        model = WhisperForConditionalGeneration.from_pretrained(model_path)
        tokenizer = WhisperTokenizer.from_pretrained(model_path)

        self.logger.info(f'loading audio from {media_path}')
        audio = self.load_audio(file=media_path)
        self.logger.info(f'audio size {audio.size}')
        
        # Process the audio to get the input values
        input_features = processor(audio, return_tensors="pt", sampling_rate=16000).input_features
        lang_token = model.generate(input_features, max_new_tokens=1)[0,1]
        language_code = tokenizer.decode(lang_token) #<|en|>

SuiGio · August 30, 2024, 10:34am

That’s great. Is it possible to add this to a pipeline?

pipe = pipeline(
“automatic-speech-recognition”,
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=10, # Reduce chunk length to manage memory usage
batch_size=8, # Reduce batch size to manage memory usage
return_timestamps=True,
torch_dtype=torch_dtype, # Set the appropriate dtype
device=device, # Set the appropriate device (GPU or CPU)
)

Ideally I want the object from pipe to have either a single “langauge” var or even better have it for each timestamped row.
How would you go about it? I see you’re using native model, not a pipeline, and not sure how to complement the pipeline with the additional stuff I want.
Thanks

SuiGio · August 30, 2024, 11:48am

It seems the simple addition in pipeline

return_language=True

does the trick.

dmavroeidis · February 25, 2025, 1:41pm

This solution is the simplest one. No other fuss!

It provides the language field in each of the individual chunks!

Topic		Replies	Views
How to set language in Whisper pipeline for audio transcription? 🤗Transformers	2	9046	June 22, 2023
How to set audio language in Whisper Pipeline? 🤗Transformers	6	6892	December 2, 2024
How to fine-tune whisper on unsupported language? Beginners	1	182	October 12, 2024
How to change language for answer? Beginners	0	240	May 28, 2023
Open ai whisper fine tuning on unknown language Beginners	0	80	October 1, 2024

Language detection with Whisper

Related topics