Fine-tuning Whisper-large-v3 for child reading assessment with numerals and proper names

Hi everyone,

I’m working on a reading assessment product for children.

Current setup:

  • a child reads a known passage for about 1 minute
  • our system then counts how many words were read correctly
  • right now we use whisper-1 as a baseline
  • we now want to move to an open model and fine-tune Whisper-large-v3 on our own infrastructure

This is not a generic ASR task:

  • we always know the reference text in advance
  • our main metric is correct-word-count accuracy against the reference passage

The main cases we want to improve through fine-tuning are:

  • numerals / spoken-written forms, for example “three” vs “3”
  • proper names and other rare words
  • child reading speech in general

I’d like advice specifically on the fine-tuning strategy for this type of task.

My questions:

  1. For this use case, what training targets would you recommend for fine-tuning: verbatim spoken transcripts, normalized transcripts, or transcripts matching the reference text format?
  2. How much data is usually needed to see meaningful improvement when fine-tuning Whisper-large-v3 for child reading speech?
  3. What data mix would you recommend for training:
    • general child speech
    • child reading audio
    • oversampled examples with numerals
    • oversampled examples with proper names / rare words
  4. Would you start with LoRA or full fine-tuning for this kind of adaptation?
  5. If the main goal is to improve numerals and proper names, is it better to do one fine-tuning run on all data, or a staged approach:
    • first domain adaptation on child speech
    • then additional fine-tuning on hard cases like numerals and proper names
  6. Has anyone here fine-tuned Whisper-large-v3 specifically for child speech or reading assessment? If so, what setup worked best for you?

Planned stack:

  • Transformers
  • PEFT / LoRA
  • Accelerate
  • base model: openai/whisper-large-v3

I’d really appreciate practical advice on data volume, dataset composition, and fine-tuning strategy for this specific use case.

Thanks!

1 Like

great,i would like to contribute

1 Like

For now, resources I’ve gathered from publicly available information:


Treat this as passage-aware child read-speech scoring, not generic ASR. The closest public work shows that the best systems for this family of tasks do not rely on raw transcription alone. They combine adapted ASR with prompt/context use, alignment, and scoring logic built around the known text. Child-speech studies also show that even strong foundation models still need child-specific adaptation, especially for younger speakers and noisy real-world recordings. (arXiv)

My overall recommendation

Use Whisper-large-v3 as a fidelity-first ASR model, then keep numeral normalization and correct-word counting against the known passage outside the model. In other words:

  • fine-tune the model to preserve what the child actually said
  • normalize three ↔ 3 and similar cases in a separate canonicalization layer
  • align ASR output to the known passage and compute the score there
  • use passage-specific prompting for rare words and names, but do not make the full passage text the main training target. (arXiv)

That separation is the highest-leverage design choice for your case.


1. Training targets: verbatim, normalized, or reference-format?

Recommendation

Use lightly normalized verbatim transcripts as the primary fine-tuning target. That means:

  • preserve substitutions, insertions, deletions, self-repairs, and repetitions that matter to reading assessment
  • normalize only superficial formatting such as casing, punctuation, whitespace, and maybe a small set of annotation conventions
  • do not rewrite the label to match the passage text when the child read something else.

Why not “transcripts matching the reference text format” as the main target?

Because that trains the model to correct the child instead of transcribing the child. The recent reading-mistake-detection paper found that even inference-time prompting with the read text can hurt or behave unpredictably, and for Whisper large-v3 they explicitly reported more hallucinations than large-v2 in some prompt settings. In parallel, the error-preserving ASR paper argues that preserving learner errors is necessary because cleaned-up ASR makes downstream feedback impossible. (arXiv)

What I would do in practice

Keep three representations:

  1. Training transcript: lightly normalized verbatim speech
  2. Canonical scoring transcript: normalize only what your rubric says is equivalent
  3. Reference passage text: the exact target text used for alignment and correct-word counting

This keeps the ASR model honest and moves rubric-specific equivalences into the scoring layer where they belong.

Numerals

For training labels, I would usually use the spoken lexical form. If the child says “three,” the label is three, not 3. The numeric-expression literature is explicit that the “correct” written form depends on context, such as 1945 versus 19:45, so numeric rendering is often an inverse text normalization problem rather than a pure acoustic-recognition problem. (arXiv)

So for your scorer, normalize both hypothesis and passage into the same comparison space:

  • 3 ↔ three
  • 12 ↔ twelve
  • year-style and date-style variants when relevant
  • any product-specific numeral rules you use. (arXiv)

Proper names and rare words

For names, I would keep the main ASR target fidelity-first, then add contextual biasing at inference. The Whisper docs explicitly say prompt_ids can be used to bias toward custom vocabularies and proper nouns, and CB-Whisper exists precisely because Whisper struggles with rare named entities and can benefit from contextual biasing before decoding. (Hugging Face)

So the core answer to question 1 is:

  • Primary target: lightly normalized verbatim
  • Not recommended as primary target: passage-matching labels
  • Use separately: canonical scoring normalization and passage-aware alignment.

2. How much data is usually needed?

There is no universal threshold, but the public evidence supports a practical planning range.

My planning range

  • 10–20 hours of well-matched child read speech: enough to start seeing real gains
  • 30–50 hours: much more stable improvements
  • 80+ hours: noticeably better long-tail coverage, error preservation, and robustness to messy cases.

Why I say that

A recent Whisper-large-v3 child-speech study fine-tuned on an initial 16-hour in-domain child set and then improved further by extending it to 28 hours. Their best model reduced median WER from 52.7% for the base model to 21.2%, and the 28-hour version outperformed the 16-hour one.

Another relevant paper built an 85-hour corpus of young learner speech specifically to preserve speaker errors and reported that their fine-tuned model outperformed stronger generic baselines on error preservation. That is not the same task as oral reading, but it is directly relevant to your need to keep mistakes visible in the transcript.

The broader child-ASR literature also shows that child-specific corpora matter. The benchmark paper uses MyST and OGI specifically because child speech needs separate evaluation and adaptation. (arXiv)

Practical interpretation

If you already have production audio from your app, I would treat 10–20 hours of clean, matched, rubric-aligned child reading data as the first serious milestone. That is more valuable than a much larger pile of mismatched adult speech. The Estonian results and the benchmark both point in that direction.


3. What data mix would I use?

My recommended starting mix

Assuming you already have some real product data, I would start here:

  • 60–70% matched child reading audio from your task
  • 15–25% broader child speech
  • 10–15% hard-case slices with numerals, proper names, rare words, and common reading disfluencies
  • 5–10% synthetic or semi-synthetic support data for targeted coverage gaps.

Why child reading should dominate

Your task is not “recognize children talking.” It is “recognize children reading a known passage, then score words against that passage.” The mixed fine-tuning paper is important here: when exact in-domain read-child data was not available, mixed fine-tuning across partially matched datasets beat single-source fine-tuning, and a 50/50 mixture of read-childrenized speech and spontaneous child speech gave the best read-child results in that setup. That implies matched speaking style matters a lot, not just age match.

Because you do have the real task format, I would push the mix even harder toward matched child reading than that paper did.

What each bucket is for

Matched child reading audio
This is the highest-value bucket. It teaches the model your exact acoustics, timing, pause patterns, teacher-interruption patterns, and reading-error distribution. The Estonian large-v3 paper is a good reminder that in-domain child data can drastically improve behavior in realistic settings.

General child speech
This is the second-most useful bucket. It helps with child acoustics and articulation variability, but it does not fully teach the reading-task error pattern. The benchmark paper’s separate read and spontaneous child corpora show why both domains matter. (arXiv)

Oversampled numerals
I would include them, but mainly to improve recognition under child acoustics, not to force one final written rendering. The numeric-expression paper supports treating formatting as its own problem and also shows that synthetic adaptation data can help. (arXiv)

Oversampled proper names and rare words
Yes. These should be overrepresented in training and also handled at inference with passage-specific prompting. Rare names are a known Whisper weak point, and CB-Whisper is dedicated to that exact issue. (ACL Anthology)

My sampling rule

I would not just “include” hard cases. I would make them overrepresented in training batches and create separate dev slices for:

  • numerals
  • proper names
  • rare words
  • youngest readers
  • insertion-heavy reads
  • self-corrections and repetitions
  • overlapping adult prompts.

That split is consistent with the reading-assessment work, the disfluency paper, and the Estonian large-v3 findings about adult overlap. (arXiv)


4. LoRA or full fine-tuning?

My answer

Start with PEFT, but do not assume vanilla LoRA is the best PEFT method for Whisper on child speech.

Why

The strongest public benchmark for child ASR found:

  • LoRA underperformed full fine-tuning in their Whisper experiments
  • adapter tuning performed much better
  • for Whisper-large-v3 on MyST-test, adapter tuning matched full fine-tuning
  • the gap between PEFT and full fine-tuning got smaller as model size increased. (arXiv)

That makes the strategy fairly clear:

  • if you want fastest iteration: start with PEFT
  • if PEFT means only vanilla LoRA in your stack: start there, but treat it as a baseline
  • if you can compare LoRA vs adapters, do that
  • only move to full fine-tuning if PEFT plateaus below your target correct-word-count accuracy. (arXiv)

Practical recommendation

Given your stack, I would do this order:

  1. LoRA baseline
  2. adapter-style PEFT baseline
  3. full fine-tuning only if both PEFT options stall. (arXiv)

5. One run on all data, or staged?

My answer

Use a staged approach.

Recommended staging

Stage A: child-domain adaptation
Train on the broadest matched child-reading mixture with lightly normalized verbatim labels. The goal is to make the model hear child speech, suppress adult overlap, and stabilize reading-style decoding.

Stage B: hard-case continuation
Continue from the best Stage A checkpoint with oversampling for:

  • numerals
  • proper names
  • rare words
  • repetitions
  • self-corrections
  • deletion-heavy and insertion-heavy reads.

Do not use only hard cases here. Keep replay from the broad child-reading data so the model does not forget the main domain. That replay recommendation is my inference from the multi-stage child-speech results and the mixed fine-tuning evidence.

Stage C: inference-time contextual biasing and scoring
Add passage-specific prompting for names and rare words, then align output to the known passage and compute the score in canonical form. (Hugging Face)

Why staged beats one flat run

The clearest public evidence is the Estonian Whisper-large-v3 paper. Their best model came from a four-stage path:

  1. OpenAI multilingual pretraining
  2. fine-tuning on adult Estonian
  3. fine-tuning on school-age child speech
  4. final tuning on the target-domain child corpus

They state that this multi-stage strategy gave the most robust gains, especially for younger age groups.

The mixed fine-tuning paper supports the same overall idea from a different angle: when no perfect in-domain corpus exists, combining partially matched datasets works better than a single-source adaptation.

One caution

Do not assume that prompting Whisper with the full passage text is automatically good. The reading-assessment paper found that direct read-text prompting could hurt, while strange high-error prompts unexpectedly helped, and large-v3 showed more hallucination problems than large-v2 in that setup. So I would use the known passage mainly in alignment and scoring, and use prompts mainly for lexical biasing toward names and rare words. (arXiv)


6. Has anyone fine-tuned Whisper-large-v3 specifically for child speech or reading assessment?

Child speech: yes

The closest public large-v3 child-speech fine-tuning paper is the Estonian study. They fine-tuned Whisper-large-v3 on child speech and found that the best model, using multi-stage adaptation plus more in-domain data, reduced median WER from 52.7% to 21.2%. They also report that the best model learned to suppress overlapping adult prompts better than the base model.

Reading assessment: close, but not exactly your full setup

The closest public reading-assessment result I found uses Whisper plus prompting and a second-pass alignment/LLM pipeline on child read speech. It improved transcription WER from 9.4% to 5.1% and reading-mistake F1 from 0.39 to 0.73. That is very relevant because it shows the strongest gains came from Whisper + passage-aware refinement, not from plain first-pass decoding. (arXiv)

Disfluency-aware reading assessment: yes, adjacent evidence

The Spanish reading-assessment paper is not Whisper-large-v3 fine-tuning, but it is highly relevant because it improves children’s reading assessment by explicitly modeling disfluencies and generating synthetic disfluent text from known reading prompts. That supports your plan to oversample hard cases rather than pretending child oral reading is clean speech.

Bottom line on public precedent

There are good adjacent public examples, but I did not find a public paper that exactly matches your full stack:

  • Whisper-large-v3
  • one-minute child passage reading
  • correct-word-count as the main metric
  • special emphasis on numerals and proper names.

So you are in a real applied gap, not a saturated recipe space. (arXiv)


What I would build for your exact stack

A. Transcript policy

Lock this before scaling data collection.

Use a written policy for:

  • repetitions
  • self-corrections
  • partial words
  • numeral equivalence
  • acceptable name variants
  • adult prompts in the background
  • punctuation/case stripping rules.

My recommended policy:

  • ASR labels: lightly normalized verbatim
  • scorer: canonicalized equivalence
  • product score: passage-aware alignment.

B. Evaluation metric

Do not select checkpoints by global WER alone.

Use:

  • correct-word-count accuracy
  • error-type metrics for insertion, substitution, deletion
  • slice metrics for numerals and names
  • hallucination rate
  • age-band slices
  • overlap/background-speech slices. (arXiv)

C. Inference design

For one-minute passages, I would use:

  • fixed task="transcribe"
  • fixed language if monolingual
  • passage-specific prompt_ids for names and rare words
  • careful long-form settings with condition_on_prev_tokens, temperature fallback, compression_ratio_threshold, logprob_threshold, and no_speech_threshold. The current Whisper docs expose all of those explicitly. (Hugging Face)

I would start with short lexical prompts, not the full passage.

D. Alignment

Do not rely only on raw first-pass tokens for scoring. The strongest reading-assessment result uses alignment-aware post-processing, and the Spanish paper relies on task-aware decoding because the target text is known. That is a strong sign that your main metric should be computed in a dedicated alignment layer. (arXiv)


Practical PEFT / Transformers notes for your stack

These are worth getting right early.

  • The current PEFT Whisper guide says to set remove_unused_columns=False and label_names=["labels"] because PeftModel does not expose the same signature as the base model. (Hugging Face)
  • The same guide sets model.config.forced_decoder_ids = None and model.config.suppress_tokens = [] during training, then restores task/language decoding prompts at inference. (Hugging Face)
  • With gradient checkpointing, set use_cache=False during training. Hugging Face discussion threads state that use_cache=True is incompatible with gradient checkpointing because of past_key_values. (Hugging Face Forums)
  • There is a public PEFT issue where Whisper can receive input_ids instead of input_features, causing a failure in training code paths. (GitHub)
  • There are also public issues around prompt_ids interacting badly with repetition_penalty, and around Whisper word timestamps failing with beam search in some Transformers setups. (GitHub)

For your use case, that means:

  • keep prompting simple at first
  • be careful with repetition penalties
  • avoid overcomplicated timestamp decoding during early model selection. (GitHub)

Direct answers to your six questions

1. Training targets
Use lightly normalized verbatim transcripts as the main fine-tuning target. Keep numeral equivalence and passage matching in a separate canonical scoring layer. Do not train primarily on passage-matching labels.

2. Data needed
Expect meaningful gains from 10–20 hours of matched child reading data, more stable gains around 30–50 hours, and stronger long-tail coverage above that. The public large-v3 child-speech evidence shows real improvement already at 16–28 hours, while the error-preserving work shows what larger matched corpora can buy you.

3. Data mix
Make matched child reading audio the majority. Add some broader child speech, then oversample numerals, names, and rare words, with a small amount of synthetic support. The mixed fine-tuning and disfluency papers support this direction.

4. LoRA or full FT
Start with PEFT, but compare LoRA with adapter-style PEFT if possible. The public child-ASR benchmark says LoRA was worse than full fine-tuning, while adapters became competitive and matched full fine-tuning for Whisper-large-v3 on MyST-test. (arXiv)

5. One run or staged
Use a staged approach: child-domain adaptation first, then hard-case continuation, then inference-time contextual biasing and alignment. That matches the best large-v3 child-speech results and the strongest reading-assessment architectures.

6. Public precedent
Yes, for child speech and adjacent reading-assessment work. No, not a perfect one-to-one match for your exact product shape. The closest public evidence still supports your direction strongly.


My blunt conclusion

For your product, the highest-probability path is:

  • fine-tune Whisper-large-v3 on lightly normalized verbatim child reading
  • use PEFT first
  • bias training heavily toward matched child reading audio
  • stage the adaptation
  • treat numerals as recognition plus normalization
  • treat proper names as training plus contextual biasing
  • compute the final score in a passage-aware alignment layer, not inside the ASR model. (arXiv)

The single most important decision is not LoRA vs full fine-tuning. It is whether you keep fidelity, canonicalization, and scoring as separate layers. The public evidence says you should.

2 Likes