I need help configuring the YAML file for LoRA z-image turbo

Hello everyone, how are you?

Does anyone here have a YAML model to train LoRa Z-Image Turbo? I need to train a realistic character and would like a well-configured YAML model to get good results.

I’m using ai-toolkit on an A100 GPU in the cloud.

Any help would be greatly appreciated, thank you in advance.

1 Like

maybe like this?


What’s special about Z-Image Turbo LoRA training (and why YAML differs)

Z-Image Turbo is step-distilled (built to look good in ~8 steps). If you train a LoRA on it “normally,” the distillation can break quickly (“Turbo drift”), and you end up needing more steps / higher CFG to recover quality. (RunComfy)

To address this, Ostris provides a training adapter (“de-distillation” LoRA) you load during training, then remove at inference so your LoRA still runs at distilled (fast) speeds. (Hugging Face)


Recommended baseline for your case (realistic character, A100)

This baseline matches the current “known-good” structure used by AI Toolkit configs and common Z-Image Turbo setups (FlowMatch, ~3000 steps, LR 1e-4, 8-step sampling, guidance 0). (GitHub)

YAML template (Turbo + training adapter, character LoRA, A100-friendly)

Replace paths, dataset name, and trigger_word. Keep everything else unchanged for run #1.

job: extension
config:
  name: "zit_char_realistic_lora_a100"

  process:
    - type: diffusion_trainer

      # Output + bookkeeping
      training_folder: "/workspace/ai-toolkit/output/zit_char_realistic_lora_a100"
      sqlite_db_path: "/workspace/ai-toolkit/aitk_db_zit_char_realistic.db"  # keep per-instance to avoid DB contention
      device: "cuda"
      performance_log_every: 25

      # Trigger token used in prompts/captions to “call” the character
      trigger_word: "zch4r_001"

      # LoRA capacity
      network:
        type: "lora"
        linear: 16
        linear_alpha: 16
        conv: 16
        conv_alpha: 16
        network_kwargs:
          ignore_if_contains: []

      # Checkpoint saving
      save:
        dtype: "bf16"
        save_every: 250
        max_step_saves_to_keep: 12
        save_format: "diffusers"
        push_to_hub: false

      # Dataset
      datasets:
        - folder_path: "/workspace/datasets/zit_char_realistic"   # images + optional .txt captions
          mask_path: null
          mask_min_value: 0.1
          caption_ext: "txt"
          default_caption: ""              # leave "" if you provide per-image captions
          caption_dropout_rate: 0.05
          cache_latents_to_disk: true
          cache_text_embeddings: false     # safer for Z-Image Turbo training right now
          is_reg: false
          network_weight: 1

          # Multi-res buckets recommended for Z-Image LoRAs
          resolution: [512, 768, 1024]

          # Z-Image configs often keep these fields even for 1-frame image training
          controls: []
          shrink_video_to_frames: true
          num_frames: 1
          do_i2v: true
          flip_x: false
          flip_y: false

      # Training hyperparameters
      train:
        batch_size: 1
        gradient_accumulation: 2          # effective batch ~2, stable for character learning
        steps: 3000
        lr: 0.0001

        train_unet: true
        train_text_encoder: false

        gradient_checkpointing: true
        noise_scheduler: "flowmatch"
        optimizer: "adamw8bit"
        optimizer_params:
          weight_decay: 0.0001

        timestep_type: "weighted"
        content_or_style: "balanced"
        loss_type: "mse"

        unload_text_encoder: false
        dtype: "bf16"

        # Leave these off for baseline stability
        diff_output_preservation: false
        skip_first_sample: false
        force_first_sample: false
        disable_sampling: false
        bypass_guidance_embedding: false

        ema_config:
          use_ema: false
          ema_decay: 0.99

      # Base model + training adapter
      model:
        name_or_path: "Tongyi-MAI/Z-Image-Turbo"
        arch: "zimage:turbo"

        # Key: use the training adapter while training
        assistant_lora_path: "ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v1.safetensors"

        quantize: false
        qtype: "qfloat8"
        quantize_te: false
        qtype_te: "qfloat8"

        low_vram: false
        model_kwargs: {}
        layer_offloading: false
        layer_offloading_text_encoder_percent: 1
        layer_offloading_transformer_percent: 1

      # Sampling previews during training
      sample:
        sampler: "flowmatch"
        sample_every: 250
        width: 1024
        height: 1024

        # Turbo preview settings (fast + representative)
        sample_steps: 8
        guidance_scale: 0
        seed: 42
        walk_seed: false
        neg: ""

        samples:
          - prompt: "zch4r_001, photorealistic portrait, natural skin texture, 85mm lens, soft window light, neutral background"
          - prompt: "zch4r_001, full body photo, casual street outfit, outdoor city background, golden hour, realistic"
          - prompt: "zch4r_001, candid photo, sitting in a cafe, shallow depth of field, realistic lighting"
          - prompt: "zch4r_001, studio headshot, clean backdrop, sharp focus, photorealistic"

        num_frames: 1
        fps: 1

Why these specific defaults:

  • Steps 2500–3000, LR 1e-4, rank 16, buckets 512/768/1024 are widely used as a baseline for Turbo+adapter LoRAs. (RunComfy)
  • Guidance scale = 0 is the standard preview target for Turbo (guidance-distilled behavior). (RunComfy)
  • Adapter v1 is the “safe baseline.” v2 is worth A/B testing after you get a stable run. (RunComfy)
  • The adapter is explicitly intended for shorter runs (styles/concepts/characters); very long runs can still drift and produce artifacts once you remove the adapter. (Hugging Face)

How to fill in the YAML correctly (field-by-field)

1) config.name

A label for your job. It becomes your output folder name in many workflows.

2) training_folder

Where checkpoints and previews go. Use local NVMe if possible (faster saves, less weird I/O stalls).

3) sqlite_db_path (important in cloud pods)

AI Toolkit’s UI uses SQLite for job tracking. On some cloud/persistent storage setups, SQLite can time out (especially shared filesystems). Keeping the DB on fast local disk and per instance reduces risk. (GitHub)

4) trigger_word

Make it unique (not a normal word). You’ll use it:

  • in per-image captions (recommended), and
  • in your sample prompts.

This is how you “call” the character reliably.

5) network (LoRA capacity)

  • Start with rank 16 (linear: 16) as baseline. (RunComfy)
  • If the character identity is weak after ~3000 steps, increase to linear 32 (and alpha 32) before increasing steps.

6) datasets

Key knobs for character realism:

  • resolution: [512, 768, 1024]
    Buckets improve generalization across crops/aspect ratios and are a common baseline for Turbo LoRAs. (RunComfy)

  • caption_ext: "txt" + per-image captions
    For realistic characters, captions matter more than people expect:

    • include the trigger token
    • include a class word (person, man, woman) and simple shot descriptors (portrait, full body)
    • avoid heavy style words if you want “neutral photoreal.”
  • caption_dropout_rate: 0.05
    Small dropout reduces prompt overfitting (the LoRA learns the identity rather than memorizing caption phrases).

  • cache_latents_to_disk: true
    Often speeds training once latents are cached. (RunComfy)

  • cache_text_embeddings: false
    There have been Z-Image Turbo related embedding/latents batch mismatch issues reported; leaving this off is the “least surprise” baseline. (GitHub)

7) train

  • steps: 3000 is a common first run target for 10–30 images. (RunComfy)
  • batch_size: 1 + gradient_accumulation: 2 gives you more stable updates without triggering batch-related weirdness.
  • noise_scheduler: flowmatch matches known working Turbo configs. (GitHub)
  • dtype: bf16 is generally stable on A100.

8) model.assistant_lora_path (the critical Turbo bit)

This is the training adapter. The adapter’s model card explains:

  • why it’s needed for step-distilled training,
  • why it’s best for shorter runs,
  • and that you remove it at inference to keep Turbo speed. (Hugging Face)

Dataset recipe for a realistic character (what “good results” usually require)

For a photoreal character LoRA, the usual failure modes are “face drift,” “same pose every time,” or “background leaks into identity.”

A practical dataset layout (10–30 images) consistent with common Turbo LoRA advice: (RunComfy)

  • 40% close/portrait (face fidelity, skin texture, hairline)
  • 40% half/full body (body proportions, clothing fit)
  • 20% action/candid (generalization)

Backgrounds:

  • Use varied backgrounds (indoor/outdoor) so the LoRA doesn’t glue the character to one scene.
  • Avoid repeating the same studio backdrop in most images.

Captions (simple, consistent):

  • Good: zch4r_001 person, photorealistic portrait, soft window light
  • Good: zch4r_001 person, full body photo, outdoors, casual outfit
  • Avoid: long style chains (they turn your “character LoRA” into a style LoRA)

Common issues & fixes (seen in the wild)

1) “DB timeout / Prisma P1008” in cloud storage

Often appears when the SQLite DB sits on slow/shared storage. Use a local per-instance DB path. (GitHub)

2) “Batch size of latents must be the same or half the batch size of text embeddings”

A reported Z-Image Turbo training issue; safest mitigations:

  • keep batch_size: 1
  • keep cache_text_embeddings: false (GitHub)

3) Samples “don’t change” even after 2000 steps

Common causes:

  • trigger_word not present in sample prompts/captions
  • LoRA scale is effectively 0 in your inference workflow
  • training without the adapter (Turbo drift symptoms: needs more steps/CFG to show changes) (GitHub)

4) Loss becomes NaN

Usually: too aggressive settings (dtype/LR/quantization). A known config example uses fp16 and LR 1e-4; on A100, prefer bf16 and disable quantization for baseline stability. (GitHub)


Good references (configs, guides, and issues)

High-signal guides (Turbo + adapter)

  • Hugging Face Engineering Notes: “Training a LoRA for Z-Image Turbo with the Ostris AI Toolkit” (Hugging Face)
  • RunComfy deep guide (Turbo+adapter vs De-Turbo, baseline params, guidance 0, buckets) (RunComfy)
  • Training adapter model card (why it exists, how it’s made, remove at inference, long-run caveat) (Hugging Face)

AI Toolkit config mechanics (how configs are expected to be created)

  • AI Toolkit README: copy an example config from config/examples/... and edit paths (GitHub)

Issues worth knowing about

  • SQLite timeout on cloud storage (GitHub)
  • Latents/text-embeds batch mismatch (GitHub)
  • “Samples not changing” report (GitHub)

Practical tuning path (do this in order)

  1. Run baseline YAML exactly as above for 3000 steps.

  2. If identity is weak:

    • increase linear to 32 (keep LR/steps same).
  3. If you get overfitting (looks like the training photos):

    • reduce steps to 2000–2500, or
    • increase caption_dropout_rate to 0.1, or
    • add more varied backgrounds/poses.
  4. After you have one “good” run, A/B test adapter v1 vs v2 with everything else identical. (RunComfy)