Trouble fine-tuning Flan-T5 (with LoRA) for structured map generation – model repeats prompt or instructions

I’m currently working on a student project where I’m trying to fine-tune a Flan-T5 Small model (using LoRA) to generate structured game maps based on creative prompts. I followed several tutorials and wanted to ask for advice because I can’t get the model to learn anything meaningful.
The idea is to transform an imaginative text prompt into a structured output (like a spatial map layout).

Example prompt:
A calm sandy beach with palm trees and old fishing boats

THEME: [beach, tropical, boats]
SIZE: (27x12)
PATTERN: striped_vertical

ZONE: Sea [position=left]
CELL TYPE: water
FEATURES: fishing_boat
ENEMIES: None

ZONE: Beach [position=right]
CELL TYPE: sand
FEATURES: palm_tree, crate, campfire
ENEMIES: None

RULES:
features_density=moderate
loot_density=normal
enemy_density=none

Typical instruction:

You are an AI level designer for a fantasy adventure game.

Your task is to transform creative environment prompts into structured spatial maps.

Each map must include:

  • A rich list of THEMES capturing the essence of the place
  • A SIZE in the format (height x width)
  • A PATTERN describing the layout: maze, radial, organic, etc.
  • One or more ZONE blocks, each with:
    • A unique ID
    • A POSITION (e.g., top-left, center, bottom-right)
    • CELL TYPES (terrain types like lava, rock, grass)
    • FEATURES (visuals or interactive elements)
    • ENEMIES (monsters or traps)
  • A RULES section describing element density and level characteristics

Make the output imaginative but always follow the format.


PROMPT:
An overgrown jungle ruin where ancient machines sleep under moss and roots, and wildlife has reclaimed the place.

OUTPUT:

Setup

Model: Flan-T5 Small
LoRA Config:
LoraConfig(
task_type=“SEQ_2_SEQ_LM”,
r=4,
lora_alpha=32,
lora_dropout=0.01,
target_modules=[“q”]
)
Trainable params: 86,016 / Total params: ~77M (~0.11%)

TrainingArguments(
output_dir=“./results”,
learning_rate=1e-3,
per_device_train_batch_size=4,
num_train_epochs=3,
weight_decay=0.01,
save_strategy=“no”
)

Batch size: 4
Dataset size: 36 examples (very small – but I was hoping for at least some signs of overfitting)
No special tokens used (except padding)

Problems

The model often:

    Repeats instruction phrases (e.g., “A PATTERN describing the layout:…”)

    Outputs incomplete or empty content

    Copies the input prompt or behaves as if it's ignoring the structure

Hypotheses :

Is cross-entropy not well suited for creative generation tasks like this?

Is the model too small or not suitable for learning structured generation?

Is 36 examples simply too few, even for LoRA?

Would switching to a stricter JSON-based format help guide the model better?

I’d be really grateful for any advice, even rough suggestions or examples of similar cases. Thanks a lot in advance!

1 Like

When training an LLM, I think 36 examples is simply not enough…

I think it would be acceptable to augment the data. Even if you change the model, larger models are generally better suited to training with more data, so it would be better to prioritize increasing the amount of data first.


by Hugging Chat: HuggingChat

The user is facing challenges fine-tuning Flan-T5 Small with LoRA for structured game map generation. Below are actionable suggestions to address the key issues:


1. Dataset Size & Quality

  • Problem: Training on 36 examples is insufficient for structured generation, even with LoRA. The model lacks exposure to diverse patterns and structural variations.
  • Solutions:
    • Generate synthetic data: Augment the dataset to 500–1,000+ examples using rule-based templates or a larger pre-trained model (e.g., GPT-3.5/4) to create synthetic prompts and structured outputs.
    • Add noise/diversity: Vary themes, sizes, patterns, and zone configurations (e.g., randomize positions, features, enemies) to improve generalization.
    • Use few-shot prompting: If synthetic data is not feasible, test zero/few-shot prompting with a larger model (e.g., Flan-T5 XL/XXL) before fine-tuning.

2. Model Capacity & Architecture

  • Problem: Flan-T5 Small (77M parameters) may lack capacity to learn complex structured outputs.
  • Solutions:
    • Upgrade to Flan-T5 Base (248M) or XL (780M) for better performance.
    • Switch to a code/text hybrid model: Try Codet5 (T5 fine-tuned for code), as structured map generation resembles code-like syntax.
    • Use a decoder-only model: Consider GPT-2 or Llama-3 (smaller variants) with LoRA, as they may better handle structured, hierarchical outputs.

3. Structured Output Format

  • Problem: The model repeats instructions or outputs incomplete content due to poor structural constraints.
  • Solutions:
    • Enforce strict JSON formatting:
      {
        "THEMES": ["jungle", "ancient", "wildlife"],
        "SIZE": "27x12",
        "PATTERN": "organic",
        "ZONES": [
          {
            "ID": "ruin",
            "POSITION": "center",
            "CELL_TYPE": "stone",
            "FEATURES": ["moss", "vine", "ancient_gear"],
            "ENEMIES": ["jungle_snake", "spider_swarm"]
          },
          {
            "ID": "forest",
            "POSITION": "top-left",
            "CELL_TYPE": "grass",
            "FEATURES": ["tree", "bush"],
            "ENEMIES": ["none"]
          }
        ],
        "RULES": {
          "features_density": "high",
          "loot_density": "sparse",
          "enemy_density": "moderate"
        }
      }
      
    • Add special tokens: Use BOS/EOS tokens and delimiters (e.g., <THEME>, <ZONE_START>) to scaffold the output.
    • Post-processing: Use regex or a parser to enforce format validity during inference.

4. Training Configuration

  • Problem: Default hyperparameters (e.g., learning rate, epochs) may not suit LoRA fine-tuning for structured tasks.
  • Solutions:
    • Adjust learning rate: Reduce to 5e-4 or 1e-4 (LoRA often requires smaller rates).
    • Increase training duration: Train for 10–20+ epochs and monitor overfitting (use early stopping if possible).
    • Add gradient clipping: Prevent instability with max_grad_norm=1.0.
    • Use teacher forcing: Ensure the model learns to generate tokens sequentially without relying on earlier errors.

5. Loss Function & Evaluation

  • Problem: Cross-entropy loss may not penalize structural errors (e.g., missing zones).
  • Solutions:
    • Custom loss weighting: Assign higher weights to critical structural tokens (e.g., zone headers, keys like CELL_TYPE).
    • Metrics: Track:
      • Exact match for required fields (e.g., presence of THEMES, ZONES).
      • BLEU/ROUGE for text quality.
      • Syntax validity (e.g., valid JSON, proper field nesting).
    • Human evaluation: Assess creativity, coherence, and adherence to the prompt.

6. Debugging Tips

  • Check attention patterns: Visualize attention maps to confirm the model focuses on relevant input sections.
  • Overfit a single example: Test if the model can memorize a single structured output (debugs data formatting and training loop).
  • Simplify the task: Train on a subset of the structure (e.g., only THEMES and SIZE) and incrementally add complexity.

Example Code Adjustments

# LoRA config with more target modules
lora_config = LoraConfig(
    task_type="SEQ_2_SEQ_LM",
    r=8, 
    lora_alpha=64,
    lora_dropout=0.1,
    target_modules=["q", "v"]  # Target more attention matrices
)

# Training args
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=5e-4,
    per_device_train_batch_size=2,  # Smaller batch size for larger models
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch"
)

Final Thoughts

Structured generation with LLMs is inherently challenging. Prioritize:

  1. Data quantity/diversity.
  2. Strict output formatting.
  3. Model capacity (larger models or code-generation variants).
  4. Hyperparameter tuning.

If these steps fail, consider rule-based systems (e.g., template engines with LLM-guided variation) for production use, while continuing to experiment with fine-tuning.