Compute_metrics() behaves strangely in distributed setting

piggyman · July 28, 2024, 6:19pm

Hello!

I am using the HF Seq2SeqTrainer to fine-tune CodeT5 on a task. I run via torchrun and run on a single node with 2 GPUs.

I recently noticed that compute_metrics() function I wrote receives EvalPredictions instance with more sequences than there are in my validation set. The val set I provide to the Trainer is of size 250; whereas if you add

print(preds.shape)

into your compute_metrics() function, it prints (264, *) where * is variable depending on the longest prediction the model comes up with.

Upon further inspection, the excess [250,264) entries are just repeats of [1,14)? This issue disappears completely if I run without parallelisation.

Also, compute_metrics() seems to be getting called from both devices as the .shape information is printed twice (once from rank 0 and once from rank 1).

Can anyone suggest what is happening?

Topic		Replies	Views
EvalPrediction has an unequal number of label_ids and predictions 😫 🤗Transformers	3	1309	June 19, 2024
Difference between using or not compute_metric Beginners	2	865	November 27, 2023
Trainer doesn't call compute_metrics during evaluation 🤗Transformers	2	847	June 24, 2024
Trainer class, compute_metrics and EvalPrediction 🤗Transformers	6	14492	October 28, 2020
How to define the compute_metrics() function in Trainer? 🤗Transformers	3	16443	December 20, 2021

Compute_metrics() behaves strangely in distributed setting

Related topics