Hello there,
I am having trouble optimizing my inference translation pipeline. The bottleneck seems to be beam_search using only 1 cpu whereas 1 gpu and 16 cpus are available.
Here is an overview of cpu usage thanks to py-spy:
I have looked at the code to try to understand what is happening without success.
(https://github.com/huggingface/transformers/blob/bd469c40659ce76c81f69c7726759d249b4aef49/src/transformers/generation_beam_search.py#L208)
If you need the lines of the problematic code :
Thread 188015 (active): "MainThread"
process (transformers/generation_beam_search.py:273)
beam_search (transformers/generation_utils.py:2285)
generate (transformers/generation_utils.py:1385)
decorate_context (torch/autograd/grad_mode.py:27)
infer_dataset (multi_translator.py:43)
_CallAndUpdateTrace (fire/core.py:681)
_Fire (fire/core.py:466)
Fire (fire/core.py:141)
main (multi_translator.py:58)
<module> (multi_translator.py:62)
_run_code (runpy.py:87)
_run_module_as_main (runpy.py:197)
What seems strange is that I thought beam_search was using the gpu to be fast (we can see the device=device in the code). I don’t know why the cpu is used here and how to make it use either the gpu or all the available cpus.
Here is a small reproducible code :
device=0
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-mul-en")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-mul-en").to(device)
dataset_class = TranslationDataset("data.csv", tokenizer)
for input_ids, attention_mask in tqdm.tqdm(DataLoader(dataset_class,
batch_size=64,
num_workers=16,
pin_memory=True)):
tokenized_outputs = model.generate(input_ids=input_ids.to(device),
attention_mask=attention_mask.to(device),
max_length=512)
class TranslationDataset(Dataset):
def __init__(self, dataset_path, tokenizer):
super().__init__()
self.tokenizer = tokenizer
self.dataset = pd.read_csv(dataset_path)["text"].values
def __len__(self) -> int:
return len(self.dataset)
def __getitem__(self, idx: int):
x = self.tokenizer(self.dataset[idx], return_tensors="pt", max_length=512, padding='max_length')
return x['input_ids'][0], x['attention_mask'][0]
Could you help me figure out why beam_search is the bottleneck of the pipeline and how to make it work on multi cpus or gpu please ?
Relevant infos:
- num_beams > 1
- num_beam_groups = 1
- do_sample = False
- is_constraint_gen_mode = False
Thanks in advance