Distributed inference on multiple files

Reproducing the issue from github Deadlock when loading the model in multiprocessing context 路 Issue #15976 路 huggingface/transformers 路 GitHub

I am using the following snippet

import torch
from pathlib import Path
import multiprocessing as mp
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

queue = mp.Queue()


def load_model(filename):
    device = queue.get()
    print('Loading')
    model = AutoModelForSeq2SeqLM.from_pretrained('models/sqgen').to(device)
    print('Loaded')
    queue.put(device)


def parallel():
    num_gpus = torch.cuda.device_count()

    with mp.get_context('spawn').Pool(processes=num_gpus) as pool:
        for gpu_id in range(num_gpus):
            queue.put('cuda:{0}'.format(gpu_id))
        pool = mp.Pool(processes=num_gpus)
        flist = list(Path('data').glob('*.json'))
        pool.map(
            load_model,
            flist,
        )
        pool.close()
        pool.join()


if __name__ == '__main__':
    parallel()

This just hangs when loading the model. This is minimal example I cooked up to demonstrate the issue.

What I am actually doing is that, I have 16 large files (possibly more) and 8 GPUs, so I am trying to assign each file to a GPU and do the inference in parallel 8 processes at a time to use all GPUs simultaneously.

Why is this issue happening? Why does model loading deadlock?
What鈥檚 the right way to do what I want to achieve?