Get the loss from all TPU cores

nielsr · June 29, 2021, 3:01pm

In the distributed evaluation section of the docs, it is said that one should use accelerator.gather to collect the data from the various devices.

My question: do you also need to use accelerator.gather when collecting the losses from the various cores? I defined the loss calculation as follows:

# Evaluate at the end of the epoch (distributed evaluation as we have 8 TPU cores)
model.eval()
validation_losses = []
for batch in val_dataloader:
     with torch.no_grad():
         outputs = model(**batch)
     loss = outputs.loss

     # We gather the loss from the 8 TPU cores to have them all.
     validation_losses.append(accelerator.gather(loss))

# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", sum(validation_losses) / len(validation_losses))

However, this fails:

File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
  File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 290, in gather
    return gather(tensor)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/utils.py", line 144, in _tpu_gather
    return xm.mesh_reduce(name, tensor, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
  File "/usr/local/lib/python3.7/dist-packages/accelerate/utils.py", line 171, in gather
    return _tpu_gather(tensor, name="accelerate.utils.gather")
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
  File "/usr/local/lib/python3.7/dist-packages/accelerate/utils.py", line 144, in _tpu_gather
    return xm.mesh_reduce(name, tensor, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated

Because the loss is a zero-dimensional tensor instead of 2D.

sgugger · June 29, 2021, 3:23pm

You need a dimension on which to gather (by default the first one), so you should just add one:

validation_losses.append(accelerator.gather(loss[None]))

and it should work.

sgugger · September 22, 2021, 12:43pm

Also note that now Accelerate automatically adds that dimension when doing a gather of 0d-tensors, so you can now just do:

validation_losses.append(accelerator.gather(loss))

Topic		Replies	Views
Question about calculating training loss of multi-GPU with Accelerate 🤗Accelerate	1	856	July 20, 2024
Worse performance using Accelerate 🤗Accelerate	0	1044	January 15, 2024
Accelerate device error when running evaluation 🤗Accelerate	0	56	August 12, 2024
How do i get Training and Validation Loss during fine tuning 🤗Transformers	2	14668	August 27, 2021
Using `torch.distributed.all_gather_object` returns error when using 1 GPU but works fine for multiple GPUs 🤗Accelerate	3	2896	July 5, 2023

Get the loss from all TPU cores

Related topics