Task="text2text-generation" and model="google/flan-t5-(base or large)" fails to generate testcases from description

Aim is to generate QA test cases from the description. I used text2text-generation models flan-t5 but it’s fails to generate. It’s copying prompt and producing as output. which model should be best and fit for test cases generation.

from transformers import pipeline

# Correct pipeline for T5
generator = pipeline(
    task="text2text-generation",
    model="google/flan-t5-large"
)

prompt = """
Generate 3 test cases for the following requirement.

Requirement:
The system should allow a user to log in using a valid email and password.
If the credentials are invalid, an error message should be displayed. Please generate test cases in below format

Format:
Test Case ID:
Title:
Preconditions:
Steps:
Expected Result:
"""

result = generator(
    prompt,
    max_length=200,
    do_sample=False
)

print("\nGenerated Test Cases:\n")
print(result[0]["generated_text"])

output:-

(venv) siva@localhost:~/app/share/vscode/NewHuggingFacePOC$python hf_test.py
Device set to use cpu

Generated Test Cases:

If the user's email and password are not valid, an error message should be displayed.
(venv) siva@localhost:~/app/share/vscode/NewHuggingFacePOC$

requirements installed: -

transformers
torch
requests
python-dotenv
accelerate
sentencepiece
python - 3.11.8 also tried 3.12
1 Like

hi,
This isn’t a Python / Transformers install problem. Your generation is being hard-capped by max_length, and with T5-style models that cap counts prompt + output, so you’re starving the model of room to actually emit 3 full test cases. Use max_new_tokens (output-only) instead. Hugging Face Forums+1

Here’s the same approach, fixed:
from transformers import pipeline

generator = pipeline(
task=“text2text-generation”,
model=“google/flan-t5-large”,
)

result = generator(
prompt,
max_new_tokens=350, # output budget (this is the key)
num_beams=4, # helps with structured outputs
do_sample=False,
no_repeat_ngram_size=3,
repetition_penalty=1.1,
)

print(result[0][“generated_text”])

If it still “summarises” instead of following your template, don’t fight it blindly—give it one example test case in the prompt (few-shot). FLAN-T5 often behaves much better with a single concrete example than with pure instructions.

Which model is best for test-case generation?

Bluntly: encoder–decoder T5 models are OK, but modern instruction-tuned chat LLMs usually do this task better, especially for rigid templates.

My practical picks on Hugging Face right now:

  • Qwen2.5 Instruct family (good instruction-following; choose size based on your hardware). Hugging Face+1
  • If you specifically want “testing / code adjacent” behaviour, CodeT5+ is a strong code-oriented option—but it’s more naturally aimed at code tasks than formatted QA test descriptions. Hugging Face+1

If you want to try Qwen quickly (chat/instruct style), use text-generation and the model’s chat template:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = “Qwen/Qwen2.5-7B-Instruct” # smaller: Qwen/Qwen2.5-1.5B-Instruct
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=“auto”)

gen = pipeline(“text-generation”, model=model, tokenizer=tok)

messages = [
{“role”: “system”, “content”: “You generate software QA test cases in the exact requested template.”},
{“role”: “user”, “content”: prompt},
]

text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
out = gen(text, max_new_tokens=450, do_sample=False, temperature=0.0)
print(out[0][“generated_text”])

the key takeaway is switch max_lengthmax_new_tokens for FLAN-T5, and if you want consistently structured multi-testcase output, use an instruct chat model (Qwen2.5 / Llama / Mistral class) + chat template.

hope this helps, Liam

1 Like

Seems known issue?


Your code is not “broken.” You are hitting predictable behavior for seq2seq (T5/FLAN-T5) generation:

  • It often chooses a short, safe completion under greedy decoding.
  • Your formatting cues are weaker than they look because T5 tokenization collapses repeated whitespace and newlines.
  • Your length control uses max_length, which is frequently the wrong knob for this use case.

Below is a detailed diagnosis and fixes that reliably turn “one sentence” into “3 fully structured test cases.”


What is happening in your output

You asked for 3 test cases. The model output is:

“If the user’s email and password are not valid, an error message should be displayed.”

That is basically the model doing a minimal paraphrase of the requirement. This pattern is widely reported with FLAN-T5 for “generate N items” prompts: it returns one item or one sentence unless you force length and structure. (Hugging Face Forums)

Also, people see related failure modes like “model repeats the prompt/instructions” when the task is structured and the model is not strongly constrained. (Hugging Face Forums)

So the behavior is common, not specific to your machine or installation.


Root causes (the real ones)

1) max_length is the wrong knob for “give me 3 long structured artifacts”

In Transformers, max_new_tokens is recommended for controlling how many tokens the model generates. max_length exists mainly for backward compatibility and can be confusing. (Hugging Face)

Key idea:

  • max_length bounds total sequence length (prompt + generated) for many generation flows.
  • max_new_tokens bounds generated tokens only (ignores prompt length). (Hugging Face Forums)

If your prompt is long, max_length=200 can leave little budget for output. Even when it does not strictly “truncate,” it biases toward short endings.

Fix: switch to max_new_tokens.


2) Greedy decoding tends to pick the shortest high-probability answer

You are using:

  • do_sample=False and default num_beams=1 → greedy decoding.

Greedy decoding often returns:

  • one sentence
  • one bullet
  • the most “obvious” clause

Transformers explicitly documents that decoding strategy strongly affects output quality and length. (Hugging Face)

Fix: use beam search (num_beams > 1) and optionally a length penalty.


3) The model is allowed to stop early (EOS) and it will

Generation ends when the model emits an EOS token. That is normal.

To prevent “one sentence then stop,” you must force minimum output length with min_new_tokens (or min_length). These are first-class generation parameters in Transformers. (Hugging Face)

Fix: set min_new_tokens.


4) Your “Format:” block is not as strong as it looks because T5 collapses newlines

T5 tokenization is based on SentencePiece and multiple newlines / repeated whitespace often get normalized. This is discussed directly in HF threads and issues. (Hugging Face Forums)

Practical consequence:

  • The model does not “see” your nicely separated template the way you see it.
  • So it does not feel compelled to emit all headers.

Fix: use explicit separators that survive tokenization, like ### TestCase 1 ###, FIELDS:, ID=, TITLE=, or emit JSON.


Immediate fix (keep FLAN-T5, but generate correctly)

A. Parameter fix (minimal change)

Use max_new_tokens and force a minimum length. Also use beam search.

from transformers import pipeline

generator = pipeline(
    task="text2text-generation",
    model="google/flan-t5-large",
)

prompt = (
    "You are a QA engineer.\n"
    "Generate EXACTLY 3 test cases in the format below.\n"
    "Use explicit separators.\n\n"
    "REQUIREMENT:\n"
    "- The system should allow a user to log in using a valid email and password.\n"
    "- If the credentials are invalid, an error message should be displayed.\n\n"
    "OUTPUT FORMAT (repeat 3 times):\n"
    "### TEST_CASE <1..3> ###\n"
    "Test Case ID: TC-LOGIN-<1..3>\n"
    "Title:\n"
    "Preconditions:\n"
    "Steps:\n"
    "Expected Result:\n"
)

result = generator(
    prompt,
    max_new_tokens=350,      # recommended knob :contentReference[oaicite:7]{index=7}
    min_new_tokens=200,      # prevents “one sentence then stop” :contentReference[oaicite:8]{index=8}
    num_beams=4,             # less “shortest answer wins” :contentReference[oaicite:9]{index=9}
    do_sample=False,
    length_penalty=1.1,      # mildly favors longer completions
    early_stopping=False,
)

print(result[0]["generated_text"])

Why this works:

  • You give the model enough output budget (max_new_tokens). (Hugging Face)
  • You force it not to stop immediately (min_new_tokens). (Hugging Face)
  • You reduce greedy “one-liner” tendency via beam search. (Hugging Face)
  • You avoid relying on blank lines as structure because T5 often collapses them. (Hugging Face Forums)

B. Prompt fix (make structure “tokenizer-proof”)

Avoid “Format:” followed by blank lines as your main structure signal. Replace with:

  • ### ... ### markers
  • KEY=VALUE style fields
  • JSON

Because repeated whitespace/newlines are not stable cues for T5 tokenization. (Hugging Face Forums)


Production-grade fix: enforce structure (stop depending on the model)

If you need this to work reliably across many requirements, you should constrain output.

Modern stacks do “guided decoding” or “structured outputs”:

  • force JSON schema
  • force regex
  • force a fixed set of fields

vLLM documents this feature and explicitly supports backends like Outlines and lm-format-enforcer. (vLLM)

Why it matters:

  • Prompting alone will always occasionally produce:

    • missing fields
    • 2 test cases instead of 3
    • merged steps
    • random prose

Which model is “best” for test case generation?

Key reality

“Generate QA test cases from text requirements” is closer to:

  • instruction following
  • structured long-form output
  • sometimes reasoning about edge cases

That is usually better served by decoder-only instruct models than by T5-style seq2seq models, especially if you want consistent formatting.

Good open models (practical picks)

Pick based on your hardware.

If you are CPU-only and want something small

  • Qwen2.5-0.5B-Instruct: instruction-tuned, very small, large context for its size. (Hugging Face)
  • Phi-3.5-mini-instruct (and its ONNX variant): designed as a lightweight high-quality small model, long context. (Hugging Face)

These will usually follow “generate 3 items with fields” better than FLAN-T5, but you still get best results with schema constraints.

If you have a GPU or can run quantized (4-bit) models

  • Mistral-7B-Instruct-v0.3: strong instruction-following; common baseline for structured generation tasks. (Hugging Face)
  • Llama-3.1-8B-Instruct: strong general instruct model family. (Hugging Face)

Why FLAN-T5 is not ideal here (even though it’s “instruction tuned”)

FLAN-T5 is instruction-finetuned over a mixture of tasks. (Hugging Face)
But:

  • it still often returns short answers for “generate N items”
  • its tokenizer behavior weakens template formatting signals (Hugging Face Forums)
  • you must do more parameter/prompt engineering to get stable structured output (Hugging Face Forums)

Similar issues online (matches your symptoms)


Recommended path (most reliable to least)

  1. Keep FLAN-T5 but switch to max_new_tokens, set min_new_tokens, use num_beams, and use explicit separators. (Hugging Face)
  2. Switch model to a small instruct causal LM (Qwen2.5-0.5B-Instruct or Phi-3.5-mini-instruct) if you want better instruction adherence. (Hugging Face)
  3. Enforce schema (guided decoding / structured outputs) if you need near-100% formatting correctness. (vLLM)

Short summary

  • Use max_new_tokens, not max_length. (Hugging Face)
  • Force length with min_new_tokens and reduce greedy behavior with num_beams. (Hugging Face)
  • Do not rely on blank lines for T5 formatting. Newlines collapse. (Hugging Face Forums)
  • If you can switch models, prefer instruct causal LMs (Qwen2.5, Phi-3.5, Mistral, Llama). (Hugging Face)
  • For production, enforce structure with guided decoding. (vLLM)
1 Like