CPU Multiprocessing for Text Generation

kmfoda · April 29, 2021, 3:42pm

Hello. I’m trying to use multiprocessing when generating summaries on text within a data frame. The pool.map() command just hangs when I run it on a custom generate function. I tried debugging by removing every line of the generate function (shown below) and it seems to work fine if I remove the model.generate() part.

Is this not the way to run multiprocessing on the generate function? Or is multiprocessing on CPUs not possible with the generate function?

import torch.multiprocessing as mp
def generate(i):

  content = df.loc[i, 'content_web']
  location = df.loc[i, 'location']
  content = content + '. ' + location

  tokenized_text = tokenizer(content, truncation=True, padding=True, return_tensors='pt')
  source_ids = tokenized_text['input_ids'].to(device, dtype = torch.long)
  source_mask = tokenized_text['attention_mask'].to(device, dtype = torch.long)

  generated_ids = model.generate(
      input_ids = source_ids,
      attention_mask = source_mask, 
      max_length=512,
      min_length=50, 
      num_beams=4,
      repetition_penalty=2.5, 
      length_penalty=2.0,
      early_stopping=True,
      no_repeat_ngram_size=8,
  )

pool = mp.Pool(processes=4)
results = pool.map(generate2, (range(0, 2)))

OlivierCR · July 19, 2021, 4:01pm

I have the same issue for prediction with AutoModelForSequenceClassification. pool.map() just hangs, while a Python map works fine.

svemired · July 28, 2021, 5:31pm

Hi,
I have the same issue. I am using MarianMT model. Is their any fix for this problem ?

Topic		Replies	Views
Problem with torch.multiprocessing and Roberta 🤗Transformers	2	2609	March 14, 2021
Debugging parallel Datasets transformations 🤗Datasets	3	2074	December 17, 2022
Map multiprocessing Issue 🤗Datasets	31	17620	July 16, 2024
How does `datasets.Dataset.map` parallelize data? Beginners	3	3086	August 5, 2024
Generate text on multiple GPU 🤗Transformers	2	1301	May 10, 2021

CPU Multiprocessing for Text Generation

Related topics