Retrieving avg_logprob and other metrics for segments using whisper

import json
import transformers as tf


pipe = tf.pipeline(
    "automatic-speech-recognition",
    model=r"/Users/maor/.cache/huggingface/hub/models--ivrit-ai--whisper-large-v3/snapshots/766847c9795b3b5cc0d42f8476199c711d5cee21",
    return_timestamps=True,
    generate_kwargs={
        "language": "he",
        "max_new_tokens": 445,
        "temperature": [0.0, 0.2],
        "logprob_threshold": -1,
    }
)

result = pipe("r0_10.wav")

with open("result.txt", "w") as f:
    json.dump(result, f, ensure_ascii=False)

Hello. How can I achieve that metrics like avg_logprob and compression_ratio (like in openai-whisper) will be produced in the output segments?

I tried adding to generate_kwargs return_segments=True and return_dict_in_generate=True but got: ‘dict’ object has no attribute ‘dtype’

1 Like

Seems bug and/or compatibility issue…?


You won’t get avg_logprob or compression_ratio from the HF automatic-speech-recognition pipeline. The error AttributeError: 'dict' object has no attribute 'dtype' comes from a pipeline bug when return_segments=True and timestamps are enabled; outputs["tokens"] becomes a dict and the postprocess path tries to read .dtype. Also, return_dict_in_generate=True is not supported in the ASR pipeline and triggers a different postprocess failure. Use faster-whisper, or call the model’s .generate(...) yourself and compute the metrics. (Hugging Face)

Background in one place

  • avg_logprob: mean log-probability of the chosen tokens for a segment. Whisper treats a segment as failed if < −1.
  • compression_ratio: gzip(text)/len(text). Whisper treats a segment as failed if > 2.4.
    These thresholds drive “temperature fallback” (retry at higher temps when segments look bad). (OpenAI)

Causes

  1. Pipeline dtype bug with segments+timestamps. When you ask for return_segments=True and timestamps, outputs["tokens"] is a dict, not a tensor; the pipeline checks .dtype and crashes. (GitHub)
  2. Pipeline doesn’t support Whisper-specific diagnostics. no_speech_prob, avg_logprob, compression_ratio are considered model-specific and aren’t exposed by the pipeline. (Hugging Face)
  3. return_dict_in_generate=True inside the ASR pipeline. Not supported; postprocessing expects arrays/tensors and hits errors. (GitHub)

Solutions

A) Get the metrics today with faster-whisper :white_check_mark:

# pip install faster-whisper
# Docs/Repo: https://github.com/SYSTRAN/faster-whisper
from faster_whisper import WhisperModel
import json

model = WhisperModel("large-v3")  # device/compute_type can be set if needed
segments, info = model.transcribe(
    "r0_10.wav",
    language="he",                   # Hebrew
    temperature=[0.0, 0.2],          # fallback temps
    log_prob_threshold=-1.0,         # note: name is log_prob_threshold here
    compression_ratio_threshold=2.4, # Whisper paper default
)

out = [{
    "start": s.start, "end": s.end, "text": s.text,
    "avg_logprob": s.avg_logprob,
    "compression_ratio": s.compression_ratio,
    "no_speech_prob": s.no_speech_prob,
    "temperature": s.temperature,
} for s in segments]

with open("result_with_metrics.json", "w") as f:
    json.dump(out, f, ensure_ascii=False, indent=2)
# faster-whisper Segment exposes avg_logprob/compression_ratio by design.
# https://github.com/SYSTRAN/faster-whisper

faster-whisper mirrors Whisper’s thresholds and exposes the per-segment fields you want. (GitHub)

B) Stay on HF. Bypass the pipeline and compute metrics yourself

# HF Whisper docs: https://huggingface.co/docs/transformers/en/model_doc/whisper
# 'avg_logprob' = mean log softmax over generated ids.
# 'compression_ratio' = len(gzip(text)) / len(text).

import io, gzip, torch, torch.nn.functional as F
import soundfile as sf
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "openai/whisper-large-v3"  # or your local path
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).eval()

audio, sr = sf.read("r0_10.wav")
if sr != 16000:
    import librosa
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Force Hebrew transcription to avoid language flips.
# https://huggingface.co/openai/whisper-tiny (example of forced_decoder_ids)
forced = processor.get_decoder_prompt_ids(language="he", task="transcribe")

out = model.generate(
    **inputs,
    forced_decoder_ids=forced,
    max_new_tokens=445,
    temperature=0.0,
    return_dict_in_generate=True,
    output_scores=True,
)

scores = out.scores                          # list of [B,V] logits
seq = out.sequences[0]                       # token ids
gen_ids = seq[-len(scores):]                 # generated part only

logps = [F.log_softmax(scores[t], dim=-1)[0, gen_ids[t]].item()
         for t in range(len(scores))]
avg_logprob = sum(logps) / max(1, len(logps))

text = processor.tokenizer.decode(seq, skip_special_tokens=True)
gz = gzip.compress(text.encode("utf-8"))
compression_ratio = len(gz) / max(1, len(text.encode("utf-8")))

print({"avg_logprob": avg_logprob, "compression_ratio": compression_ratio, "text": text})
# For per-segment metrics: split your audio into the same chunks you output,
# run this block per chunk, then attach the metrics to each chunk in JSON.

This avoids the pipeline’s postprocess path and gives you the exact numbers. Definitions and thresholds match the paper. (Hugging Face)

C) Hybrid when you want the pipeline JSON

  1. Run the pipeline with return_timestamps=True but without return_dict_in_generate and without return_segments=True until the bug is fixed.
  2. For each returned chunk’s time span, slice audio and call model.generate(..., output_scores=True) as in B.
  3. Compute and merge avg_logprob and compression_ratio back into your pipeline chunks.
    This preserves HF output format and adds diagnostics. (Hugging Face)

Version and configuration tips

  • If you see “empty segments” with recent Transformers releases, pin to a known good version and test. Users reported regressions around 4.47.x when timestamps are enabled. (GitHub)
  • Parameter names differ: HF uses logprob_threshold; faster-whisper uses log_prob_threshold. Update kwargs accordingly. (GitHub)
  • Temperature fallback works only if you pass a list of temperatures (e.g., [0.0, 0.2, 0.4]), and thresholds decide when to retry. (GitHub)

Pitfalls to avoid

  • Don’t pass return_dict_in_generate=True through the ASR pipeline. Use the model API instead. (GitHub)
  • The dtype crash with return_segments=True + timestamps is tracked; avoid that combination until patched. (GitHub)
  • If you need fixed language (Hebrew), set forced_decoder_ids = processor.get_decoder_prompt_ids(language="he", task="transcribe") on the model generate call. (Hugging Face)

Short, curated references

GitHub issues

  • Pipeline crash with return_segments=True + timestamps (dict has no .dtype). (GitHub)
  • Pipeline cannot return scores when return_dict_in_generate=True. (GitHub)
  • Empty segments with timestamps in newer releases. (GitHub)

HF docs / model cards

Hugging Face forums

  • Whisper-specific metrics not exposed by the ASR pipeline; suggested workaround is model-level generate. (Hugging Face)
  • Getting scores from Whisper with output_scores=True. (Hugging Face Forums)

Original definitions

  • Whisper paper with thresholds −1.0 and 2.4 for fallback. (OpenAI)

Alternative implementation

  • faster-whisper repo and Segment fields. (GitHub)

Bottom line: HF’s ASR pipeline won’t emit avg_logprob/compression_ratio and currently crashes with return_segments=True+timestamps. Switch to faster-whisper for built-in metrics, or call WhisperForConditionalGeneration.generate(..., output_scores=True) per chunk and compute the two values yourself. (Hugging Face)