Using multiple CPU threads to run LLM model

drwootton · June 12, 2023, 2:47am

I’m trying to run LLM models, for instance Minotaur-13B-Landmark on my CPU with multiple threads since my GPU setup with RTX 3060 and RTX 4070 has 24GB memory which is enough to load the model but not to run a query.

I can load the model on my CPU, using AutoModelForCausalLM but when I run the query it only runs a single thread. My I9-10900X has 10 core/20 thread so I would get speedup from multiple threads.

Am I missing a setting somewhere that is limiting me to a single thread? This is a Linux Fedora 36 system. I’d also like to load the model in 16 bit, which I think is supported by I9-10900X (Cascade Lake)

I tried installing OpenVino. When I load the model using OVModelForCausalLM it runs an ONNX conversion step, and somewhere in that step it runs out of memory (48GB memory + 50GB swap) and Linux kills it, even with a smaller model like galactica-6.7B.

drwootton · June 13, 2023, 2:11am

I gave up on OpenVino. I’m not sure at this point if that would have worked anyway without a huge amount of memory. I figured out how to load a LlamaCpp model, which at least for my test with a 13B parameter model loads quickly, runs multiple threads, is reasonably quick and has reasonable memory usage.

Topic		Replies	Views
Offloading LLM models to CPU uses only single core 🤗Transformers	1	4048	June 3, 2024
Simultaneous processing of multi-queries to the LLM model Models	1	2689	July 4, 2024
Pass CPU cores to speed up inference 🤗Optimum	1	3120	June 14, 2022
Loadig the LLAMA 30B Model. Memory Issue Models	2	2182	July 27, 2023
CPU generate is only using 15% cpu (LLaMA 13B) 🤗Transformers	0	1322	April 9, 2023

Using multiple CPU threads to run LLM model

Related topics