Hi all,
I’ve been evaluating a few causal LMs (e.g. Qwen/Qwen2.5-3B
) on 512 samples from the togethercomputer/RedPajama-Data-1T-Sample
pertrain dataset, and I noticed that eval loss consistently decreases as I increase the batch size:
Batch size | Eval loss |
---|---|
1 | 2.414 |
2 | 2.340 |
4 | 2.299 |
8 | 2.298 |
16 | 2.296 |
I saw the same trend across other models as well. This is the code I’m using:
sft_config = SFTConfig(
output_dir="./results",
per_device_train_batch_size=args.batch_size,
per_device_eval_batch_size=args.batch_size,
dataset_text_field="text",
max_seq_length=args.max_seq_length,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
eval_dataset=dataset,
processing_class=tokenizer,
)
eval_result = trainer.evaluate()
print(eval_result)
Digging in, I found that fixed_cross_entropy
(in loss_utils.py
) does a token-level sum then divides by the total non-padding token count (micro-averaging). By contrast, I implemented a sample-wise average (macro-averaging):
# Hugging Face: token-sum / total_tokens
loss = F.cross_entropy(..., reduction="sum") / num_items_in_batch
# My version: per-sequence average then mean across sequences
loss = F.cross_entropy(..., reduction="none")
loss = loss.view(B, -1).sum(dim=1) / token_counts_per_seq
loss = loss.mean()
With macro-averaging, eval loss is identical across batch sizes and input orderings, enabling a few nice benefits:
- We can choose optimal batch size to speed up evaluation, especially when comparing models of different sizes.
- Sorting samples by length before batching reduces padding, halving evaluation time.
So I’m wondering:
- Is the Trainer’s default (micro-averaging) behavior on purpose—to tie loss scale strictly to total token count?
- Does this have any documented effect on training stability or convergence when you vary batch size?
- Are there recommended best practices for loss normalization in large-batch LLM training (e.g. should I always override this to macro-average)?
I’d love to hear from anyone who’s dug into this or has empirical experience with different loss-averaging schemes in the Trainer. Thanks in advance!