CPU Multiprocessing for Text Generation

Hello. I’m trying to use multiprocessing when generating summaries on text within a data frame. The pool.map() command just hangs when I run it on a custom generate function. I tried debugging by removing every line of the generate function (shown below) and it seems to work fine if I remove the model.generate() part.

Is this not the way to run multiprocessing on the generate function? Or is multiprocessing on CPUs not possible with the generate function?

import torch.multiprocessing as mp
def generate(i):

  content = df.loc[i, 'content_web']
  location = df.loc[i, 'location']
  content = content + '. ' + location

  tokenized_text = tokenizer(content, truncation=True, padding=True, return_tensors='pt')
  source_ids = tokenized_text['input_ids'].to(device, dtype = torch.long)
  source_mask = tokenized_text['attention_mask'].to(device, dtype = torch.long)

  generated_ids = model.generate(
      input_ids = source_ids,
      attention_mask = source_mask, 
      max_length=512,
      min_length=50, 
      num_beams=4,
      repetition_penalty=2.5, 
      length_penalty=2.0,
      early_stopping=True,
      no_repeat_ngram_size=8,
  )

pool = mp.Pool(processes=4)
results = pool.map(generate2, (range(0, 2)))
2 Likes

I have the same issue for prediction with AutoModelForSequenceClassification. pool.map() just hangs, while a Python map works fine.

Hi,
I have the same issue. I am using MarianMT model. Is their any fix for this problem ?