Problem with torch.multiprocessing and Roberta

I have a project in which I extract entities from multiple files, line by line. So I wrote a function that receives a file and both Roberta and it’s Tokenizer. The idea is to spawn multiple processes and run this function asynchronously for each file (actually at this point the files are already loaded on memory). I have 16GB of ram on my machine and I thought this would be sufficient to at least run 2 or 3 robertas in parallel, but the following codes hangs and fills 100% of my ram and does nothing. Does someone knows what am I doing wrong with the multiprocessed code? I’ve simplified my problem to this few lines of code that have the same problem.

    import torch
    from transformers import RobertaTokenizer, RobertaModel,RobertaForTokenClassification
    from tqdm import tqdm
    from torch.multiprocessing import Pool
    import torch.multiprocessing as mp

    model = RobertaForTokenClassification.from_pretrained('distilroberta-base')
    tokenizer = RobertaTokenizer.from_pretrained('distilroberta-base')

    model.share_memory() # is this necessary?
    model.eval()

    ctx = mp.get_context('spawn')
    p = ctx.Pool(2)

   
    def f(model,tokenizer,sentence):

        inputs = tokenizer(sentence, return_tensors="pt")

        logits = model(**inputs)
        
        return 0


    sentences = [
        'yo this is a test',
        'yo this is not a test',
        'yo yo yo'
    ]

    jobs = []
    with torch.no_grad():
        for i in range(len(sentences)):
            job = p.apply_async(f, [model,tokenizer,sentences[i]])
            jobs.append(job)

        results=[]

        for job in tqdm(jobs):
            pass
            results.append(job.get())
1 Like