How to make model.generate() process using multiple CPU cores?

Magicloud · February 10, 2025, 3:31pm

Hi, I am trying to run some sentence generation with Pytorch and Transfermers from Janus on my host with no GPU. It took like forever. And simple stat told me that the Python3 process just took one CPU core and like 15GiB mem. I am not sure of this but I wonder if it could be faster with more cores, like 20.

I found model.generate single CPU core bottleneck · Issue #24524 · huggingface/transformers, and it is correct behavior when there is a GPU.

Is it possible to have it benefit from multiple CPU cores when there is no GPU?

Thanks.

PS:
The environment is Python 3.9.21, Transformers 4.48.3. The work can be reproduced via Janus demo:

import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images

# specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).eval()

conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{question}",
        "images": [image],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# # run the model to get the response
outputs = vl_gpt.language_model.generate( ### This is where it costs only one core
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

John6666 · February 10, 2025, 9:47pm

Transformers is tuned for GPUs and multi-GPUs, and is not suited to CPUs. Furthermore, Python itself is not suited to multi-threading or multi-processing.

However, there are various libraries for speeding things up, as there is a lot of demand for inferencing on CPUs. They are a little difficult to use, but I think it would be a good idea to try them out.

system · February 15, 2025, 1:55pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Offloading LLM models to CPU uses only single core 🤗Transformers	1	4008	June 3, 2024
Generate text on multiple GPU 🤗Transformers	2	1302	May 10, 2021
Speed up the prediction in transformers models 🤗Transformers	0	663	November 23, 2021
Multi-GPU eval in PyTorch training loop with generate method 🤗Accelerate	1	2068	August 30, 2022
When I try to inference on multiple GPUs using multiple processes, the time for model. generate() becomes very long 🤗Transformers	0	476	June 12, 2023

How to make model.generate() process using multiple CPU cores?

Related topics