What is the reason to restrict num_return_sequences to be less or equal to num_beams? Suppose I request 10 sequences to be generated with num_beams=5. Consider the case when we already have 4 finished sequences, and at some subsequent decoding iteration the best candidate continuation is EoS. Why canât we add this sequence as the fifth generated sequence and continue decoding with the same beam width from the next best 5 continuations that we have at this step?
It doesnât seem like thereâs fundamentally important background to LLM that led to this implementation.
The reason in Hugging Face Transformers is not âdeep theory,â it is:
- How their beam-search implementation and data structures are written, and
- The semantic meaning they chose for
num_beamsandnum_return_sequences.
Your proposed behavior (keep 5 beams but accumulate 10 finished sequences) is a different beam-search variant than what Transformers actually implements.
Below is a detailed, step-by-step explanation, tied directly to the docs and source.
1. What num_beams and num_return_sequences mean in Transformers
In model.generate(...) with beam search (i.e. num_beams > 1, do_sample=False):
-
num_beams
= beam width = how many partial hypotheses (beams) are kept alive at each decoding step. -
num_return_sequences
= how many finished beams you want back at the end.
Hugging Faceâs âHow to generate textâ tutorial explains it like this (for beam search):
Set
num_return_sequencesto the number of highest scoring beams that should be returned. Make sure though thatnum_return_sequences <= num_beams. (Hugging Face)
So in HFâs design:
Beam search keeps
num_beamscandidate sequences per input; at the end,num_return_sequencesjust selects how many of thosenum_beamsto return.
It is not âtarget number of total distinct sequences, independent of beam width.â
2. Where the hard restriction comes from in the code
Inside generate, once it decides youâre in beam mode, it literally checks and throws:
elif is_beam_gen_mode:
if num_return_sequences > num_beams:
raise ValueError("`num_return_sequences` has to be smaller or equal to `num_beams`.")
...
return self.beam_search(
input_ids,
beam_scorer,
...
)
The exact string appears multiple times (plain beam search, group beam search, constrained beam search). (GitHub)
So the public constraint is hard-coded: you simply cannot call beam search with num_return_sequences > num_beams.
That check reflects how the internal beam search is structured, so the more interesting question is: why did they structure it that way?
3. How Hugging Face beam search is structured internally
3.1. BeamSearchScorer and BeamHypotheses
Beam search is orchestrated by:
BeamSearchScorerBeamHypotheses(one per example in the batch)
A nice high-level explanation of these internals is in Manuel de Pradaâs âunofficial documentation for HF generateâ: (mdp_website)
-
For each example,
BeamSearchScorerholds aBeamHypothesesobject. -
BeamHypotheses:- Stores the best
num_beamsfinished sequences for that example. - When a beam emits EOS and is good enough, it is added to
BeamHypotheses. - The list is capped at
num_beams: if a new finished hypothesis arrives and the list is already full, it may replace the worst one, but the container never grows beyondnum_beams.
- Stores the best
You can see the same logic in replicated/derived code (identical in structure to HFâs original):
class BeamHypotheses(object):
def __init__(self, num_beams, max_length, length_penalty, early_stopping):
self.num_beams = num_beams
self.beams = []
self.worst_score = 1e9
def add(self, hyp, sum_logprobs):
score = sum_logprobs / (len(hyp) ** self.length_penalty)
if len(self) < self.num_beams or score > self.worst_score:
self.beams.append((score, hyp))
if len(self) > self.num_beams:
# drop the worst beam
...
Key property:
For each example, at all times, the n-best list of finished hypotheses contains at most
num_beamssequences.
When decoding ends, BeamSearchScorer.finalize(...) sorts those stored beams and selects the top num_beam_hyps_to_keep among them (and num_beam_hyps_to_keep is wired to num_return_sequences). (Gist)
So internally the contract is:
- âI will track
num_beamsongoing beams per example.â - âI will keep up to
num_beamsfinished beams per example.â - âAt the end, I will return â€
num_beamsof those finished beams.â
That is the structural reason you cannot ask for more than num_beams outputs from beam search: the algorithm literally never stores more than num_beams finished candidates.
3.2. The stopping rule: first-come, first-served
On top of that, the stopping condition that HF uses is a widely-used heuristic:
- Keep a set
F_tof completed (EOS) sequences. - Mark the example as done once
|F_t| == num_beams(whenearly_stopping=True), or when completed sequences are good enough that unfinished beams cannot beat them. (Hugging Face Forums)
Kasai et al. show that Hugging Face and fairseq both implement what they call first-come, first-served (FCFS) beam decoding:
They keep a set of completed sequences and stop decoding when the size of this set reaches the beam size. (arXiv)
Typical is_done logic inside BeamHypotheses:
def is_done(self, best_sum_logprobs: float, cur_len: int) -> bool:
if len(self) < self.num_beams:
return False
elif self.early_stopping:
return True
else:
cur_score = best_sum_logprobs / cur_len**self.length_penalty
return self.worst_score >= cur_score
So with default early_stopping=True, as soon as you have num_beams finished sequences, decoding stops for that example.
4. Now map your scenario to the actual implementation
Your thought experiment:
-
Set
num_beams = 5, ask fornum_return_sequences = 10. -
Imagine we had 4 finished sequences already.
-
At the next step, the best candidate continuation is EOS.
-
You propose:
- Add this EOS as the 5th finished sequence.
- Keep decoding using the same beam width from the next best 5 continuations.
- Repeat until we have 10 finished sequences.
There are two levels of problems here: front-end and backend.
4.1. Front-end: generate wonât even start
The high-level generate method will refuse this configuration:
if num_return_sequences > num_beams:
raise ValueError("`num_return_sequences` has to be smaller or equal to `num_beams`.")
(GitHub)
So you never reach your scenario in practice with stock HF beam search.
But assume that check were removed. What then?
4.2. Backend: you still never get more than 5 distinct outputs
Even without the check, the backend behavior would be:
-
While decoding, each time the top EOS candidate for that example is selected among the top 2Ă
num_beamsnext candidates,BeamSearchScorer.process(...)callsbeam_hyp.add(...)and does not keep that EOS beam alive. (mdp_website) -
After you have added that 5th finished hypothesis:
len(beam_hyp)becomes 5, i.e.num_beams.- With
early_stopping=True,is_done(...)returnsTrue, and decoding for that example stops. (Hugging Face Forums)
There is no âkeep going with the next 5 continuationsâ step. The loop breaks.
-
Even if you set
early_stopping=Falseso that decoding continues:BeamHypothesesstill stores at mostnum_beamsfinished hypotheses. If more EOS hypotheses are collected later, they can only replace worse ones; the container never grows beyond size 5. (Thank you very much.)- At the end, when
finalize(...)runs, it sorts whatever is inbeam_hyp.beams(length †5) and returns the bestnum_beam_hyps_to_keep. Even ifnum_beam_hyps_to_keepwere mistakenly set to 10, there would only be at most 5 items to choose from. (Gist)
So structurally:
With the current code, you can never get more than
num_beamsdistinct final sequences out of beam search, regardless of how long you run.
Your proposed algorithm needs two changes that HF does not implement:
- Let the finished-hypotheses container hold more than
num_beamssequences (e.g. cap bynum_return_sequencesinstead). - Adjust the stopping rule so it doesnât stop immediately when
len(F_t) == num_beams.
Both changes are beyond a small tweak; they change the semantics of beam search.
5. Why did Transformers choose this design?
There are three main reasons: semantics, approximate top-K correctness, and simplicity.
5.1. Semantics: num_beams controls search, num_return_sequences is just a selector
Hugging Face chose a simple, user-friendly mental model:
num_beams= how many beams to use in beam search.num_return_sequences= how many of those beams you want back at the end.
They codify that in docs and tutorials:
â
num_return_sequencesis the number of highest scoring beams that should be returned. Make sure thatnum_return_sequences <= num_beams.â (Hugging Face)
So the restriction is not an afterthought; it is part of the API contract they present.
5.2. Approximate top-K logic: beam width bounds how many âgoodâ hypotheses you can claim
Beam search is an approximate way to find top-K sequences under a scoring function (e.g. length-normalized log-probability).
Important consequence:
- If the beam width is B, the algorithm only ever keeps B partial paths at each step.
- Any candidate that might end up as 6-th, 7-th, ⊠10-th best final sequence might have been pruned earlier because it was outside the top-B at some time step.
So:
With
num_beams = 5, the algorithm only has enough information to claim an approximate top-5 list. Anything beyond 5 is not a principled â6th best, 7th best, âŠâ â itâs just ânext best among whatever survived the pruningâ.
By restricting num_return_sequences <= num_beams, Transformers avoids creating the illusion that it is giving you a meaningful âtop-10â when the search width was only 5.
If you really want 10 plausible high-scoring beams from beam search in HF, the supported pattern is:
outputs = model.generate(
**inputs,
num_beams=10,
num_return_sequences=10,
do_sample=False, # pure beam search
)
5.3. Simplicity and resource bounds
By tying all these together to num_beams, HF gets a simple, predictable complexity:
- Active beams per example:
num_beams. - Finished hypotheses stored per example: â€
num_beams. - Outputs per example: â€
num_beams.
Time and memory scale as O(num_beams Ă vocab_size Ă max_length) with no hidden multiplicative factor from storing many more finished sequences.
Your proposed variant (âkeep 5 beams but store 10+ finished hypothesesâ) is totally implementable, but it complicates:
- Memory (more finished sequences kept).
- Stopping logic (cannot just stop when |F_t| == num_beams).
- API semantics (now
num_return_sequencespartly controls search depth, not just output count).
Instead, HF keeps things simple and pushes âI want many outputsâ into sampling mode rather than changing beam semantics.
6. Is your proposed behavior theoretically reasonable?
Yes. Conceptually:
- Maintain B active beams.
- Keep collecting finished sequences in a larger set F (say size up to K > B).
- Only stop when |F| reaches K or some more principled optimality condition.
This is exactly the kind of change that Kasai et al. propose using a patience factor p:
- The standard HF/fairseq implementation is
p = 1(stop when |F| = beam_size). - They generalize to
p > 1, which lets you keep searching after you already have beam_size finished sequences. (arXiv)
With beam_size = 5 and p = 2, you would conceptually allow up to 10 finished hypotheses. That is very close to what you describe.
The key point:
This is theoretically and practically reasonable, but it is simply not what HFâs current
generate()implements.
To support it, HF would need to:
- Let
BeamHypothesesstore up tonum_beam_hyps_to_keep(orp * num_beams), instead ofnum_beams. - Change
is_doneto consider patience and/or a more optimal stopping rule (e.g. the one from Huang et al., âWhen to Finish? Optimal Beam Searchâ). (arXiv)
That is a more involved change than just removing the validation check.
7. How to get â10 candidatesâ in practice with HF
If you want many candidates and you do not insist on beam-search top-K semantics, the standard HF pattern is to use sampling:
outputs = model.generate(
**inputs,
do_sample=True,
num_beams=1, # or low beam if you really want
num_return_sequences=10, # many samples
top_k=50, # or top_p, temperature, etc.
)
In sampling mode, HF generates multiple sequences by expanding the batch (num_return_sequences) rather than by increasing beam width, so the num_return_sequences <= num_beams constraint does not need to apply conceptually. (There is/was even a GitHub issue complaining that the same check is applied in sampling where itâs not necessary.) (GitHub)
So HFâs current design is:
- Beam search: âfew, high-quality, approximately top-
num_beamsoutputsâ ânum_return_sequences <= num_beams. - Sampling: âmany diverse outputsâ â
num_return_sequencescan be large, independent of anynum_beams.
8. Summary
Putting it all together:
-
The restriction
num_return_sequences <= num_beamsexists because:- HFâs beam search internally never stores more than
num_beamsfinished hypotheses per example (BeamHypothesesis capped atnum_beams). (Thank you very much.) num_return_sequencesis defined as âhow many of thosenum_beamsbeams to return,â not as an independent target count. (Hugging Face)- The stopping rule marks an example as done once
len(F)reachesnum_beams(FCFS heuristic), so further decoding with the same beams is deliberately not done. (arXiv)
- HFâs beam search internally never stores more than
-
Your scenario (âwe already have 4 finished sequences, now we get a 5th EOS; why not keep decoding with 5 active beams until we have 10 finished sequences?â) describes a more general beam-search variant with a patience factor and a larger finished-hypothesis capacity. That variant is sound and studied in the literature, but Transformers does not currently implement it.
-
If you want more than
num_beamsoutputs with HF today:- For âtop-K styleâ results: increase
num_beamsto K and setnum_return_sequences = K. - For âmany, diverseâ results: use sampling (
do_sample=True,num_beamssmall or 1, largenum_return_sequences).
- For âtop-K styleâ results: increase
9. Suggested resources (with why theyâre useful)
-
HF blog â âHow to generate text: using different decoding methodsâ
Explains greedy vs beam vs sampling, shows the intended meaning ofnum_beamsandnum_return_sequences, and explicitly states thatnum_return_sequences <= num_beamsfor beam search. (Hugging Face) -
HF docs â âGeneration strategiesâ
Official explanation of generation modes and parameters (num_beams,early_stopping, etc.), useful to see the âofficialâ semantics HF expects. (Hugging Face) -
Manuel de Prada â âUnofficial documentation for the HuggingFace generation internalsâ
Very clear walkthrough ofbeam_search,BeamSearchScorer, andBeamHypotheses. Shows concretely thatBeamHypotheseskeeps only the bestn_beamshypotheses. (mdp_website) -
Kasai et al., âA Call for Clarity in Beam Search: How It Works and When It Stopsâ
Analyzes HF and fairseq beam search, proves they use a first-come-first-served stopping rule, and introduces the patience factor that would let you do what you propose (more finished hypotheses than the beam size). (arXiv) -
Huang et al., âWhen to Finish? Optimal Beam Search for Neural Text Generationâ
Gives a more principled stopping criterion for beam search and contrasts it with heuristics like âstop when you havekcompleted sequences.â Helpful if you want to design a more theoretically grounded variant. (arXiv)
These will give you both the exact behavior of Transformersâ implementation and the broader context of how that behavior fits into the space of possible beam-search algorithms.