Seems known issue?
Your code is not “broken.” You are hitting predictable behavior for seq2seq (T5/FLAN-T5) generation:
- It often chooses a short, safe completion under greedy decoding.
- Your formatting cues are weaker than they look because T5 tokenization collapses repeated whitespace and newlines.
- Your length control uses
max_length, which is frequently the wrong knob for this use case.
Below is a detailed diagnosis and fixes that reliably turn “one sentence” into “3 fully structured test cases.”
What is happening in your output
You asked for 3 test cases. The model output is:
“If the user’s email and password are not valid, an error message should be displayed.”
That is basically the model doing a minimal paraphrase of the requirement. This pattern is widely reported with FLAN-T5 for “generate N items” prompts: it returns one item or one sentence unless you force length and structure. (Hugging Face Forums)
Also, people see related failure modes like “model repeats the prompt/instructions” when the task is structured and the model is not strongly constrained. (Hugging Face Forums)
So the behavior is common, not specific to your machine or installation.
Root causes (the real ones)
1) max_length is the wrong knob for “give me 3 long structured artifacts”
In Transformers, max_new_tokens is recommended for controlling how many tokens the model generates. max_length exists mainly for backward compatibility and can be confusing. (Hugging Face)
Key idea:
max_length bounds total sequence length (prompt + generated) for many generation flows.
max_new_tokens bounds generated tokens only (ignores prompt length). (Hugging Face Forums)
If your prompt is long, max_length=200 can leave little budget for output. Even when it does not strictly “truncate,” it biases toward short endings.
Fix: switch to max_new_tokens.
2) Greedy decoding tends to pick the shortest high-probability answer
You are using:
do_sample=False and default num_beams=1 → greedy decoding.
Greedy decoding often returns:
- one sentence
- one bullet
- the most “obvious” clause
Transformers explicitly documents that decoding strategy strongly affects output quality and length. (Hugging Face)
Fix: use beam search (num_beams > 1) and optionally a length penalty.
3) The model is allowed to stop early (EOS) and it will
Generation ends when the model emits an EOS token. That is normal.
To prevent “one sentence then stop,” you must force minimum output length with min_new_tokens (or min_length). These are first-class generation parameters in Transformers. (Hugging Face)
Fix: set min_new_tokens.
4) Your “Format:” block is not as strong as it looks because T5 collapses newlines
T5 tokenization is based on SentencePiece and multiple newlines / repeated whitespace often get normalized. This is discussed directly in HF threads and issues. (Hugging Face Forums)
Practical consequence:
- The model does not “see” your nicely separated template the way you see it.
- So it does not feel compelled to emit all headers.
Fix: use explicit separators that survive tokenization, like ### TestCase 1 ###, FIELDS:, ID=, TITLE=, or emit JSON.
Immediate fix (keep FLAN-T5, but generate correctly)
A. Parameter fix (minimal change)
Use max_new_tokens and force a minimum length. Also use beam search.
from transformers import pipeline
generator = pipeline(
task="text2text-generation",
model="google/flan-t5-large",
)
prompt = (
"You are a QA engineer.\n"
"Generate EXACTLY 3 test cases in the format below.\n"
"Use explicit separators.\n\n"
"REQUIREMENT:\n"
"- The system should allow a user to log in using a valid email and password.\n"
"- If the credentials are invalid, an error message should be displayed.\n\n"
"OUTPUT FORMAT (repeat 3 times):\n"
"### TEST_CASE <1..3> ###\n"
"Test Case ID: TC-LOGIN-<1..3>\n"
"Title:\n"
"Preconditions:\n"
"Steps:\n"
"Expected Result:\n"
)
result = generator(
prompt,
max_new_tokens=350, # recommended knob :contentReference[oaicite:7]{index=7}
min_new_tokens=200, # prevents “one sentence then stop” :contentReference[oaicite:8]{index=8}
num_beams=4, # less “shortest answer wins” :contentReference[oaicite:9]{index=9}
do_sample=False,
length_penalty=1.1, # mildly favors longer completions
early_stopping=False,
)
print(result[0]["generated_text"])
Why this works:
- You give the model enough output budget (
max_new_tokens). (Hugging Face)
- You force it not to stop immediately (
min_new_tokens). (Hugging Face)
- You reduce greedy “one-liner” tendency via beam search. (Hugging Face)
- You avoid relying on blank lines as structure because T5 often collapses them. (Hugging Face Forums)
B. Prompt fix (make structure “tokenizer-proof”)
Avoid “Format:” followed by blank lines as your main structure signal. Replace with:
### ... ### markers
KEY=VALUE style fields
- JSON
Because repeated whitespace/newlines are not stable cues for T5 tokenization. (Hugging Face Forums)
Production-grade fix: enforce structure (stop depending on the model)
If you need this to work reliably across many requirements, you should constrain output.
Modern stacks do “guided decoding” or “structured outputs”:
- force JSON schema
- force regex
- force a fixed set of fields
vLLM documents this feature and explicitly supports backends like Outlines and lm-format-enforcer. (vLLM)
Why it matters:
Which model is “best” for test case generation?
Key reality
“Generate QA test cases from text requirements” is closer to:
- instruction following
- structured long-form output
- sometimes reasoning about edge cases
That is usually better served by decoder-only instruct models than by T5-style seq2seq models, especially if you want consistent formatting.
Good open models (practical picks)
Pick based on your hardware.
If you are CPU-only and want something small
- Qwen2.5-0.5B-Instruct: instruction-tuned, very small, large context for its size. (Hugging Face)
- Phi-3.5-mini-instruct (and its ONNX variant): designed as a lightweight high-quality small model, long context. (Hugging Face)
These will usually follow “generate 3 items with fields” better than FLAN-T5, but you still get best results with schema constraints.
If you have a GPU or can run quantized (4-bit) models
- Mistral-7B-Instruct-v0.3: strong instruction-following; common baseline for structured generation tasks. (Hugging Face)
- Llama-3.1-8B-Instruct: strong general instruct model family. (Hugging Face)
Why FLAN-T5 is not ideal here (even though it’s “instruction tuned”)
FLAN-T5 is instruction-finetuned over a mixture of tasks. (Hugging Face)
But:
- it still often returns short answers for “generate N items”
- its tokenizer behavior weakens template formatting signals (Hugging Face Forums)
- you must do more parameter/prompt engineering to get stable structured output (Hugging Face Forums)
Similar issues online (matches your symptoms)
Recommended path (most reliable to least)
- Keep FLAN-T5 but switch to
max_new_tokens, set min_new_tokens, use num_beams, and use explicit separators. (Hugging Face)
- Switch model to a small instruct causal LM (Qwen2.5-0.5B-Instruct or Phi-3.5-mini-instruct) if you want better instruction adherence. (Hugging Face)
- Enforce schema (guided decoding / structured outputs) if you need near-100% formatting correctness. (vLLM)
Short summary
- Use
max_new_tokens, not max_length. (Hugging Face)
- Force length with
min_new_tokens and reduce greedy behavior with num_beams. (Hugging Face)
- Do not rely on blank lines for T5 formatting. Newlines collapse. (Hugging Face Forums)
- If you can switch models, prefer instruct causal LMs (Qwen2.5, Phi-3.5, Mistral, Llama). (Hugging Face)
- For production, enforce structure with guided decoding. (vLLM)