Is it possible that Accelerate may not divide the data evenly among processes?

seanswyi · July 4, 2023, 1:59am

I’m currently using HuggingFace Accelerate to train a model and using sklearn.metrics.classification_report to get the results. I noticed that the support values for certain classes differ depending on whether I’m using one process vs. multiple processes.

I asked ChatGPT (lol) whether this may be true and it turns out that if the data is unevenly distributed then this may be a problem. I’m wondering how true this is because my initial intuition was that even if there’s an uneven distribution the data should still be divided (even if not perfectly equally) and later gathered, therefore the support shouldn’t differ.

Please let me know if I’m thinking incorrectly. Thanks.

muellerzr · July 4, 2023, 3:43pm

Are you gathering? Please see the example script on metrics: https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py#L180-L184

seanswyi · July 5, 2023, 12:39am

I’m actually not gathering as your suggested example. The way that my code is structured is that I refactored an existing project slightly so that I can use HF Accelerate without refactoring things too much.

The way that the code is currently gathering is by using torch.distributed.all_gather_object. More specifically, I have an intermediate value that contains predictions and labels inside of the evaluation loop, then after inference I gather everything into a final array-like object in order to perform evaluation using scikit-learn:

import torch.distributed as dist

intermediate_value = {}
output = [None] * accelerator.num_processes
for step, batch in enumerate(valid_dataloader):
    y_pred = model(batch)
    intermediate_value.setdefault("preds", []).append(y_pred)
    intermediate_value.setdefault("targets", []).append(batch["target"])

dist.all_gather_object(output, intermediate_value)

Is this approach not suggested? I would assume that this approach should work just fine but I’m wondering if there would be a difference between using HF’s approach.

muellerzr · July 5, 2023, 2:17am

You’re going to have extra items in here, so as always please definitely use the API otherwise you’ll need to drop the repeats etc

Topic		Replies	Views
Early stopping implementation in accelerate? 🤗Accelerate	4	1630	September 7, 2022
Using `torch.distributed.all_gather_object` returns error when using 1 GPU but works fine for multiple GPUs 🤗Accelerate	3	2901	July 5, 2023
Troubles with features in .prepare() 🤗Accelerate	1	35	November 30, 2024
Where can I find the code on how is data split across different process? Beginners	0	56	May 3, 2024
How to Ensure Each Process Reads Its Own Dataset and Trains Correctly When Using Trainer？ 🤗Transformers	0	15	December 20, 2024

Is it possible that Accelerate may not divide the data evenly among processes?

Related topics