CPU generate is only using 15% cpu (LLaMA 13B)

captinbaptin · April 9, 2023, 2:20pm

My code looks like this.

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("/path/to/model")
model = LlamaForCausalLM.from_pretrained("/path/to/model")
prompt="prompt text"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=1500, temperature=0.7, do_sample=True)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

For about 15 seconds it uses 50% cpu then it uses 15% cpu until its done generating.
I would like to use 100% cpu as this would be about 6x faster.

I tried googling this problem but all I could find was people trying to use the cpu instead of the gpu or people trying to run on a specific number of cpu cores/threads.

In case its relevant:
I’m running linux mint 20, this machine has nothing installed except for transformers and jupyter lab, I installed transformers in a venv, I’m using pytorch.

Topic		Replies	Views
Why transformers doesn't use Multiple GPUs (to increase tokens per second)? Beginners	7	592	September 22, 2024
LLama 3.1 torch.compile & static cache 🤗Transformers	2	304	December 9, 2024
Models slow on M1 Pro 16gb Beginners	0	729	December 18, 2023
Model.generate use_cache=True generates different results than use_cache=False Intermediate	3	186	March 4, 2025
AssertionError: Torch not compiled with CUDA enabled 🤗Transformers	0	2938	June 1, 2023

CPU generate is only using 15% cpu (LLaMA 13B)

Related topics