How to make model.generate() process using multiple CPU cores?

Hi, I am trying to run some sentence generation with Pytorch and Transfermers from Janus on my host with no GPU. It took like forever. And simple stat told me that the Python3 process just took one CPU core and like 15GiB mem. I am not sure of this but I wonder if it could be faster with more cores, like 20.

I found model.generate single CPU core bottleneck · Issue #24524 · huggingface/transformers, and it is correct behavior when there is a GPU.

Is it possible to have it benefit from multiple CPU cores when there is no GPU?

Thanks.

PS:
The environment is Python 3.9.21, Transformers 4.48.3. The work can be reproduced via Janus demo:

import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images

# specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).eval()

conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{question}",
        "images": [image],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# # run the model to get the response
outputs = vl_gpt.language_model.generate( ### This is where it costs only one core
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
1 Like

Transformers is tuned for GPUs and multi-GPUs, and is not suited to CPUs. Furthermore, Python itself is not suited to multi-threading or multi-processing.

However, there are various libraries for speeding things up, as there is a lot of demand for inferencing on CPUs. They are a little difficult to use, but I think it would be a good idea to try them out.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.