HF Trainer downstream evaluation on multiple GPUS

Hi all, I’m trying to train a language model using HF Trainer on four GPUs (multi-GPU newbie here). During evaluation, I want to track performance on downstream tasks, e.g. Image Captioning on COCO. I have overridden the evaluate() method and created the evaluation dataset in it. From the logs I can see that now during training, evaluation runs on all four GPUs in parallel, but on the same data.
My question is, is there a simple way to split the dataset into four subsets, let each GPU run prediction on a subset, and then merge the results in the end? My initial idea was to split the dataset based on the rank of the current process, but how can I send the results e.g. to the master process (rank 0)? This process would then compute the metrics for all predictions.

Thank you a lot! :hugs:

I ended up using a small helper function like this:

def evaluate_distributed(
    dataset: Dataset, 
    eval_func: Callable[[Dataset], Dict[str, float]]
) -> Dict[str, float]:

    # split into even subsets based on the local rank of each GPU
    n = len(dataset) // self.args.world_size
    subset = Subset(
        range(self.args.local_rank*n, (self.args.local_rank+1)*n)
    metrics = eval_func(subset)
    gathered_metrics = [None for _ in range(self.args.world_size)]
    dist.all_gather_object(gathered_metrics, metrics)

    all_metrics = {}
    for k, v in metrics.items():
        l = [d[k] for d in gathered_metrics]
        if isinstance(v, float):
            all_metrics[k] = sum(l) / len(l)
            all_metrics[k] = l

    return all_metrics

This method takes a dataset and a function that computes a metric dictionary from a dataset.
dist.all_gather_object() will distribute the metrics dict across all distributed processes.
Of course, it needs to be valid to compute the final metric as the mean of the metric on the individual splits.