My code looks like this.
import torch from transformers import LlamaTokenizer, LlamaForCausalLM tokenizer = LlamaTokenizer.from_pretrained("/path/to/model") model = LlamaForCausalLM.from_pretrained("/path/to/model") prompt="prompt text" inputs = tokenizer(prompt, return_tensors="pt") generate_ids = model.generate(inputs.input_ids, max_length=1500, temperature=0.7, do_sample=True) tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
For about 15 seconds it uses 50% cpu then it uses 15% cpu until its done generating.
I would like to use 100% cpu as this would be about 6x faster.
I tried googling this problem but all I could find was people trying to use the cpu instead of the gpu or people trying to run on a specific number of cpu cores/threads.
In case its relevant:
I’m running linux mint 20, this machine has nothing installed except for transformers and jupyter lab, I installed transformers in a venv, I’m using pytorch.