How to accumulate when examples per batch is not fixed

evankomp · December 11, 2024, 10:40pm

I am having trouble determining how to properly add up loss/gradients in the situation where each per device batch has a different number of “examples.”

My examples are actually atomic systems, and a batch of examples will contain atoms from different systems. Unlike tokens in NLP which are padded to the same length, my model operates on the atoms on long form and keeps track of different systems by an indicator.

Given this, I created a model wrapper that returns losses on the sum basis over atoms - thus batches with more atoms have larger losses on average. This is desired because I would like a system with more atoms to count for more during training, which is an atom level task.

In my training loop I am returning that sum loss, eg:

loss = compute_loss(model, per_device_batch) # < ---- this is not normalized
atoms_in_batch = get_atoms_in_batch(per_device_batch)

The model and the data loader are wrapped with accelerate. I imaging I can gather the atoms_in_batch to normalize loss accumulated over devices and batches, but am not sure how as we are expected to call accelerator.backward within the loop.

Thanks for any assistance.

Topic		Replies	Views
Don't average the loss Models	1	615	March 30, 2024
Gradient accumulation averages over gradient 🤗Transformers	2	2022	November 12, 2020
Is there a standard way to handle leftover batches when using gradient accumulation? Intermediate	1	616	November 22, 2021
Gradient accumulation gives different results compared to full batch Models	1	1177	December 15, 2023
Skip optimizer update when gradient norm is large with Accelerate gradient accumulation 🤗Accelerate	0	1119	November 10, 2023

How to accumulate when examples per batch is not fixed

Related topics