Problem with torch.multiprocessing and Roberta

thiagolira · March 13, 2021, 4:37pm

I have a project in which I extract entities from multiple files, line by line. So I wrote a function that receives a file and both Roberta and it’s Tokenizer. The idea is to spawn multiple processes and run this function asynchronously for each file (actually at this point the files are already loaded on memory). I have 16GB of ram on my machine and I thought this would be sufficient to at least run 2 or 3 robertas in parallel, but the following codes hangs and fills 100% of my ram and does nothing. Does someone knows what am I doing wrong with the multiprocessed code? I’ve simplified my problem to this few lines of code that have the same problem.

    import torch
    from transformers import RobertaTokenizer, RobertaModel,RobertaForTokenClassification
    from tqdm import tqdm
    from torch.multiprocessing import Pool
    import torch.multiprocessing as mp

    model = RobertaForTokenClassification.from_pretrained('distilroberta-base')
    tokenizer = RobertaTokenizer.from_pretrained('distilroberta-base')

    model.share_memory() # is this necessary?
    model.eval()

    ctx = mp.get_context('spawn')
    p = ctx.Pool(2)

   
    def f(model,tokenizer,sentence):

        inputs = tokenizer(sentence, return_tensors="pt")

        logits = model(**inputs)
        
        return 0


    sentences = [
        'yo this is a test',
        'yo this is not a test',
        'yo yo yo'
    ]

    jobs = []
    with torch.no_grad():
        for i in range(len(sentences)):
            job = p.apply_async(f, [model,tokenizer,sentences[i]])
            jobs.append(job)

        results=[]

        for job in tqdm(jobs):
            pass
            results.append(job.get())

thiagolira · March 13, 2021, 4:41pm

This actually doesn’t work even with just one worker

p = ctx.Pool(1)

So I think it is related to the multiprocessing code

BramVanroy · March 14, 2021, 7:39am

Might be related to the fast tokenizer, which is a multiprocessing-enabled Rust tokenizer. I’d suggest to have on tokenizer in a separate process and use a queue to request tokenization. You can then run different models in different processes.

However:

First thing to speed everything up is using a single model and large batches, rather than doing line-per-line operations. That is incredibly slow.
If you have a GPU available, use it. It’ll be faster.

Topic		Replies	Views
Robertaforquestionanswering 🤗Transformers	1	1939	August 3, 2020
Pre-training a language model on a large dataset 🤗Transformers	5	3581	March 15, 2022
ERROR: Could not find a version that satisfies the requirement torch==1.7.1+cpu Beginners	17	23287	December 15, 2020
Fine tuning RoBerta got an unexpected keyword argument 'labels' Intermediate	0	21	April 25, 2024
Problem in loading an old sentence classification roberta model generated using transformer version 3.0.2 with new library 🤗Transformers	0	499	September 30, 2022

Problem with torch.multiprocessing and Roberta

Related Topics