Seems bug and/or compatibility issue�
You wonât get avg_logprob or compression_ratio from the HF automatic-speech-recognition pipeline. The error AttributeError: 'dict' object has no attribute 'dtype' comes from a pipeline bug when return_segments=True and timestamps are enabled; outputs["tokens"] becomes a dict and the postprocess path tries to read .dtype. Also, return_dict_in_generate=True is not supported in the ASR pipeline and triggers a different postprocess failure. Use faster-whisper, or call the modelâs .generate(...) yourself and compute the metrics. (Hugging Face)
Background in one place
avg_logprob: mean log-probability of the chosen tokens for a segment. Whisper treats a segment as failed if < â1.
compression_ratio: gzip(text)/len(text). Whisper treats a segment as failed if > 2.4.
These thresholds drive âtemperature fallbackâ (retry at higher temps when segments look bad). (OpenAI)
Causes
- Pipeline dtype bug with segments+timestamps. When you ask for
return_segments=True and timestamps, outputs["tokens"] is a dict, not a tensor; the pipeline checks .dtype and crashes. (GitHub)
- Pipeline doesnât support Whisper-specific diagnostics.
no_speech_prob, avg_logprob, compression_ratio are considered model-specific and arenât exposed by the pipeline. (Hugging Face)
return_dict_in_generate=True inside the ASR pipeline. Not supported; postprocessing expects arrays/tensors and hits errors. (GitHub)
Solutions
A) Get the metrics today with faster-whisper 
# pip install faster-whisper
# Docs/Repo: https://github.com/SYSTRAN/faster-whisper
from faster_whisper import WhisperModel
import json
model = WhisperModel("large-v3") # device/compute_type can be set if needed
segments, info = model.transcribe(
"r0_10.wav",
language="he", # Hebrew
temperature=[0.0, 0.2], # fallback temps
log_prob_threshold=-1.0, # note: name is log_prob_threshold here
compression_ratio_threshold=2.4, # Whisper paper default
)
out = [{
"start": s.start, "end": s.end, "text": s.text,
"avg_logprob": s.avg_logprob,
"compression_ratio": s.compression_ratio,
"no_speech_prob": s.no_speech_prob,
"temperature": s.temperature,
} for s in segments]
with open("result_with_metrics.json", "w") as f:
json.dump(out, f, ensure_ascii=False, indent=2)
# faster-whisper Segment exposes avg_logprob/compression_ratio by design.
# https://github.com/SYSTRAN/faster-whisper
faster-whisper mirrors Whisperâs thresholds and exposes the per-segment fields you want. (GitHub)
B) Stay on HF. Bypass the pipeline and compute metrics yourself
# HF Whisper docs: https://huggingface.co/docs/transformers/en/model_doc/whisper
# 'avg_logprob' = mean log softmax over generated ids.
# 'compression_ratio' = len(gzip(text)) / len(text).
import io, gzip, torch, torch.nn.functional as F
import soundfile as sf
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model_id = "openai/whisper-large-v3" # or your local path
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).eval()
audio, sr = sf.read("r0_10.wav")
if sr != 16000:
import librosa
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
# Force Hebrew transcription to avoid language flips.
# https://huggingface.co/openai/whisper-tiny (example of forced_decoder_ids)
forced = processor.get_decoder_prompt_ids(language="he", task="transcribe")
out = model.generate(
**inputs,
forced_decoder_ids=forced,
max_new_tokens=445,
temperature=0.0,
return_dict_in_generate=True,
output_scores=True,
)
scores = out.scores # list of [B,V] logits
seq = out.sequences[0] # token ids
gen_ids = seq[-len(scores):] # generated part only
logps = [F.log_softmax(scores[t], dim=-1)[0, gen_ids[t]].item()
for t in range(len(scores))]
avg_logprob = sum(logps) / max(1, len(logps))
text = processor.tokenizer.decode(seq, skip_special_tokens=True)
gz = gzip.compress(text.encode("utf-8"))
compression_ratio = len(gz) / max(1, len(text.encode("utf-8")))
print({"avg_logprob": avg_logprob, "compression_ratio": compression_ratio, "text": text})
# For per-segment metrics: split your audio into the same chunks you output,
# run this block per chunk, then attach the metrics to each chunk in JSON.
This avoids the pipelineâs postprocess path and gives you the exact numbers. Definitions and thresholds match the paper. (Hugging Face)
C) Hybrid when you want the pipeline JSON
- Run the pipeline with
return_timestamps=True but without return_dict_in_generate and without return_segments=True until the bug is fixed.
- For each returned chunkâs time span, slice audio and call
model.generate(..., output_scores=True) as in B.
- Compute and merge
avg_logprob and compression_ratio back into your pipeline chunks.
This preserves HF output format and adds diagnostics. (Hugging Face)
Version and configuration tips
- If you see âempty segmentsâ with recent Transformers releases, pin to a known good version and test. Users reported regressions around 4.47.x when timestamps are enabled. (GitHub)
- Parameter names differ: HF uses
logprob_threshold; faster-whisper uses log_prob_threshold. Update kwargs accordingly. (GitHub)
- Temperature fallback works only if you pass a list of temperatures (e.g.,
[0.0, 0.2, 0.4]), and thresholds decide when to retry. (GitHub)
Pitfalls to avoid
- Donât pass
return_dict_in_generate=True through the ASR pipeline. Use the model API instead. (GitHub)
- The
dtype crash with return_segments=True + timestamps is tracked; avoid that combination until patched. (GitHub)
- If you need fixed language (Hebrew), set
forced_decoder_ids = processor.get_decoder_prompt_ids(language="he", task="transcribe") on the model generate call. (Hugging Face)
Short, curated references
GitHub issues
- Pipeline crash with
return_segments=True + timestamps (dict has no .dtype). (GitHub)
- Pipeline cannot return scores when
return_dict_in_generate=True. (GitHub)
- Empty segments with timestamps in newer releases. (GitHub)
HF docs / model cards
Hugging Face forums
- Whisper-specific metrics not exposed by the ASR pipeline; suggested workaround is model-level generate. (Hugging Face)
- Getting scores from Whisper with
output_scores=True. (Hugging Face Forums)
Original definitions
- Whisper paper with thresholds â1.0 and 2.4 for fallback. (OpenAI)
Alternative implementation
faster-whisper repo and Segment fields. (GitHub)
Bottom line: HFâs ASR pipeline wonât emit avg_logprob/compression_ratio and currently crashes with return_segments=True+timestamps. Switch to faster-whisper for built-in metrics, or call WhisperForConditionalGeneration.generate(..., output_scores=True) per chunk and compute the two values yourself. (Hugging Face)