Hey everyone
I’m working on a LoRA training experiment using ~3,000 realistic e-commerce product photos.
The goal is to build a single LoRA that captures the clean, realistic, professional product shots with consistent lighting and color.
The dataset always keeps the same composition, with the product at the center and an environement, which varies from plain simple colors to more complex such as nature, lifestyle, or more artistic still life composition.
Thanks for this more than detailed answer. I’m planning to train on a 48gb vram gpu using runpod. I’m not sure I quite well understand what you recommend regarding the batch size.
Should I keep it at 1 ? I’ve read it can impact badly the training if the dataset is to big and it is recommended to crank it up to 32/64. Is that correct ?
I’m not sure I quite well understand what you recommend regarding the batch size.
Should I keep it at 1 ?
I’ve read it can impact badly the training if the dataset is to big and it is recommended to crank it up to 32/64. Is that correct ?
Regarding dataset size and training. LoRA fine-tuning for AI models involves rewriting only a small portion of the neural network. However, since current AI neural networks are relatively small (than humanity), they quickly reach their limits. Forcing them to learn beyond these limits causes their behavior to deviate from expectations. They become overly reliant on the training data. This is overfitting. Probably.
Therefore, if the dataset is too large, you may need to use only a small portion (preferably the higher-quality part) of the dataset, or lower the learning rate to encourage broader, shallower learning. Additionally, there may be an option to automatically stop training when signs of overfitting appear (Early Stopping).
However, the best approach might be to train incrementally, obtain checkpoints at several stages, and then select the best checkpoint for your needs. If it’s underfitted, you can just train more, but overfitting can’t be undone…
Keep the micro-batch small (1–2) and reach an effective batch of 8–16 with gradient accumulation. Do not chase 32–64 by default. If you ever go that high for throughput, scale the LR linearly and extend warmup. Your quality is driven more by total steps and curation than raw batch. (Hugging Face)
Background and terms
Micro-batch = train_batch_size that fits on the GPU per step.
Effective batch = micro_batch × gradient_accumulation_steps × num_gpus. Accumulation lets you simulate a larger batch without holding it in memory at once. Accelerate provides this natively. (Hugging Face)
Why small micro-batch + moderate effective batch is right for FLUX
The official FLUX LoRA recipe fine-tunes on a 4090 with batch=1 and reports strong results. It emphasizes VRAM reality and steps over large batches. At 512×768, QLoRA peaked near ~9 GB, fp16 LoRA near ~26 GB; 10242 is heavier, so accumulation is the intended path. Your 48 GB card can usually do micro-batch 1–2 at 10242 and still leave room for context length and checkpoints. (Hugging Face)
Large batches can generalize worse unless you retune LR and warmup; the classic results show the “generalization gap” for very big batches. Keep batches moderate unless you have a specific throughput goal. (arXiv)
“I read I should push 32/64 if the dataset is big.” Context
Big dataset ≠ must use big batch. Batch size primarily trades off noise vs speed. For style LoRA, you want enough gradient noise to avoid over-smooth solutions and you want more optimizer steps at a stable LR. The FLUX write-up targets hundreds to low-thousands of steps and shows clean convergence without large batches. (Hugging Face)
If you do increase effective batch for throughput or multi-GPU scaling, apply the linear LR scaling rule: LR_new = LR_base × (B_new / B_base) and add extra warmup. This is the established recipe for large-batch stability. (arXiv)
Practical settings for your 48 GB GPU (Runpod)
Default: micro-batch 1–2, effective 8–16 via accumulation. Start LR=1e-4 constant, warmup 0–200. Keep guidance_scale=1 during training as in the FLUX recipe. (Hugging Face)
Example single-GPU presets:
Stable: batch=1, grad_accum=8 → eff 8
Faster: batch=2, grad_accum=4 → eff 8
Heavier: batch=2, grad_accum=8 → eff 16
Choose by VRAM headroom and throughput. (Hugging Face)
Step math for your 3 k images (optimize for steps, not epochs)
Steps per epoch = images / effective_batch.
eff 8 → 375 steps/epoch.
eff 16 → 187.5 steps/epoch.
Target ~1,200–2,000 steps first. Save checkpoints every 100–200 steps and pick the best by color fidelity and lighting consistency. This mirrors the FLUX LoRA guidance. (Hugging Face)
Gradient accumulation: two caveats
Accumulation is not numerically identical to one giant batch, but it’s the standard workaround and is close when loss scaling is handled correctly. Accelerate divides the loss for you; if you hand-roll loops, ensure you divide loss by accum_steps. (Hugging Face)
Some reports show minor differences between “same effective batch with/without accumulation,” which is expected and usually harmless. Don’t chase exact loss parity; judge by validation images. (PyTorch Forums)
When to consider 32–64 effective batch
Only for throughput or multi-GPU scaling needs. If you jump from eff 8 → 32, multiply LR by 4 and lengthen warmup; then re-tune because very large batches can reduce gradient noise and hurt detail unless compensated. (arXiv)
One-line recipe to use today
Keep micro-batch small. Aim eff 8–16 with accumulation.
LR 1e-4 constant, warmup 0–200.
Train to steps, not epochs. Start ~1.2k steps.
Scale LR if you scale batch. Validate with fixed prompts. (Hugging Face)