Offloading LLM models to CPU uses only single core

I am using GitHub - oobabooga/text-generation-webui: A gradio web UI for running Large Language Models like GPT-J 6B, OPT, GALACTICA, GPT-Neo, and Pygmalion. to run large LLM models like opt-30b and the new llama. This project uses Transformers and the accelerate library to offload what doesn’t fit the GPU onto the CPU.

Unfortunately on my 8 core CPU, only a single core is utilized while performing inference. The core is stuck at 100% and bounced around by the scheduler while the other cores stay idle.

Using other strategies like flexgen utilizes the entire CPU and the inference appears to be much much faster.

The loading code is very simple

        model = AutoModelForCausalLM.from_pretrained(Path(f"models/{shared.model_name}"), device_map='auto')

Is this related to the python GIL? Some other parameter I should pass?

1 Like

I’m also having same problem with Mistral 7B, you can try BetterTransformer