Beam_search bottlenecks inference with only 1 used cpu

adelplace · October 12, 2022, 7:48am

Hello there,

I am having trouble optimizing my inference translation pipeline. The bottleneck seems to be beam_search using only 1 cpu whereas 1 gpu and 16 cpus are available.

Here is an overview of cpu usage thanks to py-spy:

I have looked at the code to try to understand what is happening without success.
(https://github.com/huggingface/transformers/blob/bd469c40659ce76c81f69c7726759d249b4aef49/src/transformers/generation_beam_search.py#L208)

If you need the lines of the problematic code :

Thread 188015 (active): "MainThread"
    process (transformers/generation_beam_search.py:273)
    beam_search (transformers/generation_utils.py:2285)
    generate (transformers/generation_utils.py:1385)
    decorate_context (torch/autograd/grad_mode.py:27)
    infer_dataset (multi_translator.py:43)
    _CallAndUpdateTrace (fire/core.py:681)
    _Fire (fire/core.py:466)
    Fire (fire/core.py:141)
    main (multi_translator.py:58)
    <module> (multi_translator.py:62)
    _run_code (runpy.py:87)
    _run_module_as_main (runpy.py:197)

What seems strange is that I thought beam_search was using the gpu to be fast (we can see the device=device in the code). I don’t know why the cpu is used here and how to make it use either the gpu or all the available cpus.

Here is a small reproducible code :

device=0

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-mul-en")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-mul-en").to(device)
dataset_class = TranslationDataset("data.csv", tokenizer)

for input_ids, attention_mask in tqdm.tqdm(DataLoader(dataset_class, 
                                                      batch_size=64, 
                                                      num_workers=16, 
                                                      pin_memory=True)):
      tokenized_outputs = model.generate(input_ids=input_ids.to(device), 
                                         attention_mask=attention_mask.to(device), 
                                         max_length=512)

class TranslationDataset(Dataset):
    def __init__(self, dataset_path, tokenizer):
        super().__init__()
        self.tokenizer = tokenizer
        self.dataset = pd.read_csv(dataset_path)["text"].values
        
    def __len__(self) -> int:
        return len(self.dataset)

    def __getitem__(self, idx: int):
        x = self.tokenizer(self.dataset[idx], return_tensors="pt", max_length=512, padding='max_length')
        return x['input_ids'][0], x['attention_mask'][0]

Could you help me figure out why beam_search is the bottleneck of the pipeline and how to make it work on multi cpus or gpu please ?

Relevant infos:

num_beams > 1
num_beam_groups = 1
do_sample = False
is_constraint_gen_mode = False

Thanks in advance

adelplace · October 13, 2022, 12:06pm

It seems like I am not the only one facing this problem :

Any ideas of solution ?

Topic		Replies	Views
Is there any way to avoid CPU bottlenecks when doing single prompt inference? Intermediate	1	972	June 12, 2023
Multiple gpu not properly parallelized during model.generate() 🤗Transformers	4	1622	October 9, 2022
Very low GPU usage when translating text, datasets not helping 🤗Transformers	3	5825	July 12, 2022
NLP Pretrained model model doesn’t use GPU when making inference 🤗Transformers	11	10125	March 11, 2022
Model.generate() is extremely slow while using beam search 🤗Transformers	2	5391	July 24, 2022

Beam_search bottlenecks inference with only 1 used cpu

Related topics