I’m trying to run LLM models, for instance Minotaur-13B-Landmark on my CPU with multiple threads since my GPU setup with RTX 3060 and RTX 4070 has 24GB memory which is enough to load the model but not to run a query.
I can load the model on my CPU, using AutoModelForCausalLM but when I run the query it only runs a single thread. My I9-10900X has 10 core/20 thread so I would get speedup from multiple threads.
Am I missing a setting somewhere that is limiting me to a single thread? This is a Linux Fedora 36 system. I’d also like to load the model in 16 bit, which I think is supported by I9-10900X (Cascade Lake)
I tried installing OpenVino. When I load the model using OVModelForCausalLM it runs an ONNX conversion step, and somewhere in that step it runs out of memory (48GB memory + 50GB swap) and Linux kills it, even with a smaller model like galactica-6.7B.