Num_return_sequences > num_beams

What is the reason to restrict num_return_sequences to be less or equal to num_beams? Suppose I request 10 sequences to be generated with num_beams=5. Consider the case when we already have 4 finished sequences, and at some subsequent decoding iteration the best candidate continuation is EoS. Why can’t we add this sequence as the fifth generated sequence and continue decoding with the same beam width from the next best 5 continuations that we have at this step?

1 Like

It doesn’t seem like there’s fundamentally important background to LLM that led to this implementation.


The reason in Hugging Face Transformers is not “deep theory,” it is:

  1. How their beam-search implementation and data structures are written, and
  2. The semantic meaning they chose for num_beams and num_return_sequences.

Your proposed behavior (keep 5 beams but accumulate 10 finished sequences) is a different beam-search variant than what Transformers actually implements.

Below is a detailed, step-by-step explanation, tied directly to the docs and source.


1. What num_beams and num_return_sequences mean in Transformers

In model.generate(...) with beam search (i.e. num_beams > 1, do_sample=False):

  • num_beams
    = beam width = how many partial hypotheses (beams) are kept alive at each decoding step.

  • num_return_sequences
    = how many finished beams you want back at the end.

Hugging Face’s “How to generate text” tutorial explains it like this (for beam search):

Set num_return_sequences to the number of highest scoring beams that should be returned. Make sure though that num_return_sequences <= num_beams. (Hugging Face)

So in HF’s design:

Beam search keeps num_beams candidate sequences per input; at the end, num_return_sequences just selects how many of those num_beams to return.

It is not “target number of total distinct sequences, independent of beam width.”


2. Where the hard restriction comes from in the code

Inside generate, once it decides you’re in beam mode, it literally checks and throws:

elif is_beam_gen_mode:
    if num_return_sequences > num_beams:
        raise ValueError("`num_return_sequences` has to be smaller or equal to `num_beams`.")
    ...
    return self.beam_search(
        input_ids,
        beam_scorer,
        ...
    )

The exact string appears multiple times (plain beam search, group beam search, constrained beam search). (GitHub)

So the public constraint is hard-coded: you simply cannot call beam search with num_return_sequences > num_beams.

That check reflects how the internal beam search is structured, so the more interesting question is: why did they structure it that way?


3. How Hugging Face beam search is structured internally

3.1. BeamSearchScorer and BeamHypotheses

Beam search is orchestrated by:

  • BeamSearchScorer
  • BeamHypotheses (one per example in the batch)

A nice high-level explanation of these internals is in Manuel de Prada’s “unofficial documentation for HF generate”: (mdp_website)

  • For each example, BeamSearchScorer holds a BeamHypotheses object.

  • BeamHypotheses:

    • Stores the best num_beams finished sequences for that example.
    • When a beam emits EOS and is good enough, it is added to BeamHypotheses.
    • The list is capped at num_beams: if a new finished hypothesis arrives and the list is already full, it may replace the worst one, but the container never grows beyond num_beams.

You can see the same logic in replicated/derived code (identical in structure to HF’s original):

class BeamHypotheses(object):
    def __init__(self, num_beams, max_length, length_penalty, early_stopping):
        self.num_beams = num_beams
        self.beams = []
        self.worst_score = 1e9

    def add(self, hyp, sum_logprobs):
        score = sum_logprobs / (len(hyp) ** self.length_penalty)
        if len(self) < self.num_beams or score > self.worst_score:
            self.beams.append((score, hyp))
            if len(self) > self.num_beams:
                # drop the worst beam
                ...

(Thank you very much.)

Key property:

For each example, at all times, the n-best list of finished hypotheses contains at most num_beams sequences.

When decoding ends, BeamSearchScorer.finalize(...) sorts those stored beams and selects the top num_beam_hyps_to_keep among them (and num_beam_hyps_to_keep is wired to num_return_sequences). (Gist)

So internally the contract is:

  • “I will track num_beams ongoing beams per example.”
  • “I will keep up to num_beams finished beams per example.”
  • “At the end, I will return ≀ num_beams of those finished beams.”

That is the structural reason you cannot ask for more than num_beams outputs from beam search: the algorithm literally never stores more than num_beams finished candidates.

3.2. The stopping rule: first-come, first-served

On top of that, the stopping condition that HF uses is a widely-used heuristic:

  • Keep a set F_t of completed (EOS) sequences.
  • Mark the example as done once |F_t| == num_beams (when early_stopping=True), or when completed sequences are good enough that unfinished beams cannot beat them. (Hugging Face Forums)

Kasai et al. show that Hugging Face and fairseq both implement what they call first-come, first-served (FCFS) beam decoding:

They keep a set of completed sequences and stop decoding when the size of this set reaches the beam size. (arXiv)

Typical is_done logic inside BeamHypotheses:

def is_done(self, best_sum_logprobs: float, cur_len: int) -> bool:
    if len(self) < self.num_beams:
        return False
    elif self.early_stopping:
        return True
    else:
        cur_score = best_sum_logprobs / cur_len**self.length_penalty
        return self.worst_score >= cur_score

(Hugging Face Forums)

So with default early_stopping=True, as soon as you have num_beams finished sequences, decoding stops for that example.


4. Now map your scenario to the actual implementation

Your thought experiment:

  • Set num_beams = 5, ask for num_return_sequences = 10.

  • Imagine we had 4 finished sequences already.

  • At the next step, the best candidate continuation is EOS.

  • You propose:

    1. Add this EOS as the 5th finished sequence.
    2. Keep decoding using the same beam width from the next best 5 continuations.
    3. Repeat until we have 10 finished sequences.

There are two levels of problems here: front-end and backend.

4.1. Front-end: generate won’t even start

The high-level generate method will refuse this configuration:

if num_return_sequences > num_beams:
    raise ValueError("`num_return_sequences` has to be smaller or equal to `num_beams`.")

(GitHub)

So you never reach your scenario in practice with stock HF beam search.

But assume that check were removed. What then?

4.2. Backend: you still never get more than 5 distinct outputs

Even without the check, the backend behavior would be:

  1. While decoding, each time the top EOS candidate for that example is selected among the top 2×num_beams next candidates, BeamSearchScorer.process(...) calls beam_hyp.add(...) and does not keep that EOS beam alive. (mdp_website)

  2. After you have added that 5th finished hypothesis:

    • len(beam_hyp) becomes 5, i.e. num_beams.
    • With early_stopping=True, is_done(...) returns True, and decoding for that example stops. (Hugging Face Forums)

    There is no “keep going with the next 5 continuations” step. The loop breaks.

  3. Even if you set early_stopping=False so that decoding continues:

    • BeamHypotheses still stores at most num_beams finished hypotheses. If more EOS hypotheses are collected later, they can only replace worse ones; the container never grows beyond size 5. (Thank you very much.)
    • At the end, when finalize(...) runs, it sorts whatever is in beam_hyp.beams (length ≀ 5) and returns the best num_beam_hyps_to_keep. Even if num_beam_hyps_to_keep were mistakenly set to 10, there would only be at most 5 items to choose from. (Gist)

So structurally:

With the current code, you can never get more than num_beams distinct final sequences out of beam search, regardless of how long you run.

Your proposed algorithm needs two changes that HF does not implement:

  1. Let the finished-hypotheses container hold more than num_beams sequences (e.g. cap by num_return_sequences instead).
  2. Adjust the stopping rule so it doesn’t stop immediately when len(F_t) == num_beams.

Both changes are beyond a small tweak; they change the semantics of beam search.


5. Why did Transformers choose this design?

There are three main reasons: semantics, approximate top-K correctness, and simplicity.

5.1. Semantics: num_beams controls search, num_return_sequences is just a selector

Hugging Face chose a simple, user-friendly mental model:

  • num_beams = how many beams to use in beam search.
  • num_return_sequences = how many of those beams you want back at the end.

They codify that in docs and tutorials:

“num_return_sequences is the number of highest scoring beams that should be returned. Make sure that num_return_sequences <= num_beams.” (Hugging Face)

So the restriction is not an afterthought; it is part of the API contract they present.

5.2. Approximate top-K logic: beam width bounds how many “good” hypotheses you can claim

Beam search is an approximate way to find top-K sequences under a scoring function (e.g. length-normalized log-probability).

Important consequence:

  • If the beam width is B, the algorithm only ever keeps B partial paths at each step.
  • Any candidate that might end up as 6-th, 7-th, 
 10-th best final sequence might have been pruned earlier because it was outside the top-B at some time step.

So:

With num_beams = 5, the algorithm only has enough information to claim an approximate top-5 list. Anything beyond 5 is not a principled “6th best, 7th best, 
” — it’s just “next best among whatever survived the pruning”.

By restricting num_return_sequences <= num_beams, Transformers avoids creating the illusion that it is giving you a meaningful “top-10” when the search width was only 5.

If you really want 10 plausible high-scoring beams from beam search in HF, the supported pattern is:

outputs = model.generate(
    **inputs,
    num_beams=10,
    num_return_sequences=10,
    do_sample=False,           # pure beam search
)

5.3. Simplicity and resource bounds

By tying all these together to num_beams, HF gets a simple, predictable complexity:

  • Active beams per example: num_beams.
  • Finished hypotheses stored per example: ≀ num_beams.
  • Outputs per example: ≀ num_beams.

Time and memory scale as O(num_beams × vocab_size × max_length) with no hidden multiplicative factor from storing many more finished sequences.

Your proposed variant (“keep 5 beams but store 10+ finished hypotheses”) is totally implementable, but it complicates:

  • Memory (more finished sequences kept).
  • Stopping logic (cannot just stop when |F_t| == num_beams).
  • API semantics (now num_return_sequences partly controls search depth, not just output count).

Instead, HF keeps things simple and pushes “I want many outputs” into sampling mode rather than changing beam semantics.


6. Is your proposed behavior theoretically reasonable?

Yes. Conceptually:

  • Maintain B active beams.
  • Keep collecting finished sequences in a larger set F (say size up to K > B).
  • Only stop when |F| reaches K or some more principled optimality condition.

This is exactly the kind of change that Kasai et al. propose using a patience factor p:

  • The standard HF/fairseq implementation is p = 1 (stop when |F| = beam_size).
  • They generalize to p > 1, which lets you keep searching after you already have beam_size finished sequences. (arXiv)

With beam_size = 5 and p = 2, you would conceptually allow up to 10 finished hypotheses. That is very close to what you describe.

The key point:

This is theoretically and practically reasonable, but it is simply not what HF’s current generate() implements.

To support it, HF would need to:

  • Let BeamHypotheses store up to num_beam_hyps_to_keep (or p * num_beams), instead of num_beams.
  • Change is_done to consider patience and/or a more optimal stopping rule (e.g. the one from Huang et al., “When to Finish? Optimal Beam Search”). (arXiv)

That is a more involved change than just removing the validation check.


7. How to get “10 candidates” in practice with HF

If you want many candidates and you do not insist on beam-search top-K semantics, the standard HF pattern is to use sampling:

outputs = model.generate(
    **inputs,
    do_sample=True,
    num_beams=1,               # or low beam if you really want
    num_return_sequences=10,   # many samples
    top_k=50,                  # or top_p, temperature, etc.
)

In sampling mode, HF generates multiple sequences by expanding the batch (num_return_sequences) rather than by increasing beam width, so the num_return_sequences <= num_beams constraint does not need to apply conceptually. (There is/was even a GitHub issue complaining that the same check is applied in sampling where it’s not necessary.) (GitHub)

So HF’s current design is:

  • Beam search: “few, high-quality, approximately top-num_beams outputs” → num_return_sequences <= num_beams.
  • Sampling: “many diverse outputs” → num_return_sequences can be large, independent of any num_beams.

8. Summary

Putting it all together:

  • The restriction num_return_sequences <= num_beams exists because:

    • HF’s beam search internally never stores more than num_beams finished hypotheses per example (BeamHypotheses is capped at num_beams). (Thank you very much.)
    • num_return_sequences is defined as “how many of those num_beams beams to return,” not as an independent target count. (Hugging Face)
    • The stopping rule marks an example as done once len(F) reaches num_beams (FCFS heuristic), so further decoding with the same beams is deliberately not done. (arXiv)
  • Your scenario (“we already have 4 finished sequences, now we get a 5th EOS; why not keep decoding with 5 active beams until we have 10 finished sequences?”) describes a more general beam-search variant with a patience factor and a larger finished-hypothesis capacity. That variant is sound and studied in the literature, but Transformers does not currently implement it.

  • If you want more than num_beams outputs with HF today:

    • For “top-K style” results: increase num_beams to K and set num_return_sequences = K.
    • For “many, diverse” results: use sampling (do_sample=True, num_beams small or 1, large num_return_sequences).

9. Suggested resources (with why they’re useful)

  1. HF blog – “How to generate text: using different decoding methods”
    Explains greedy vs beam vs sampling, shows the intended meaning of num_beams and num_return_sequences, and explicitly states that num_return_sequences <= num_beams for beam search. (Hugging Face)

  2. HF docs – “Generation strategies”
    Official explanation of generation modes and parameters (num_beams, early_stopping, etc.), useful to see the “official” semantics HF expects. (Hugging Face)

  3. Manuel de Prada – “Unofficial documentation for the HuggingFace generation internals”
    Very clear walkthrough of beam_search, BeamSearchScorer, and BeamHypotheses. Shows concretely that BeamHypotheses keeps only the best n_beams hypotheses. (mdp_website)

  4. Kasai et al., “A Call for Clarity in Beam Search: How It Works and When It Stops”
    Analyzes HF and fairseq beam search, proves they use a first-come-first-served stopping rule, and introduces the patience factor that would let you do what you propose (more finished hypotheses than the beam size). (arXiv)

  5. Huang et al., “When to Finish? Optimal Beam Search for Neural Text Generation”
    Gives a more principled stopping criterion for beam search and contrasts it with heuristics like “stop when you have k completed sequences.” Helpful if you want to design a more theoretically grounded variant. (arXiv)

These will give you both the exact behavior of Transformers’ implementation and the broader context of how that behavior fits into the space of possible beam-search algorithms.