Using multiple CPU threads to run LLM model

I’m trying to run LLM models, for instance Minotaur-13B-Landmark on my CPU with multiple threads since my GPU setup with RTX 3060 and RTX 4070 has 24GB memory which is enough to load the model but not to run a query.

I can load the model on my CPU, using AutoModelForCausalLM but when I run the query it only runs a single thread. My I9-10900X has 10 core/20 thread so I would get speedup from multiple threads.

Am I missing a setting somewhere that is limiting me to a single thread? This is a Linux Fedora 36 system. I’d also like to load the model in 16 bit, which I think is supported by I9-10900X (Cascade Lake)

I tried installing OpenVino. When I load the model using OVModelForCausalLM it runs an ONNX conversion step, and somewhere in that step it runs out of memory (48GB memory + 50GB swap) and Linux kills it, even with a smaller model like galactica-6.7B.

I gave up on OpenVino. I’m not sure at this point if that would have worked anyway without a huge amount of memory. I figured out how to load a LlamaCpp model, which at least for my test with a 13B parameter model loads quickly, runs multiple threads, is reasonably quick and has reasonable memory usage.