I am using GitHub - oobabooga/text-generation-webui: A gradio web UI for running Large Language Models like GPT-J 6B, OPT, GALACTICA, GPT-Neo, and Pygmalion. to run large LLM models like opt-30b and the new llama. This project uses Transformers and the accelerate library to offload what doesn’t fit …

Offloading LLM models to CPU uses only single core

rhwauiy89 June 3, 2024, 6:13am 2

I’m also having same problem with Mistral 7B, you can try BetterTransformer

Topic		Replies	Views
How to make model.generate() process using multiple CPU cores? 🤗Transformers	2	298	February 10, 2025
Accelerate not spreading on multiple CPUs 🤗Accelerate	1	1818	August 1, 2023
Big Model Inference: CPU/Disk Offloading for Transformers Using from_pretrained 🤗Accelerate	2	4981	February 28, 2024
Inference with hugging face pipeline happening on CPU, even if model is loaded on GPU 🤗Transformers	0	1719	May 30, 2023
`text-generation` `Pipeline` prohibitively slow to load, even with cached model 🤗Transformers	1	4427	May 23, 2023