How to make model.generate() process using multiple CPU cores?

John6666 · February 10, 2025, 9:47pm

Transformers is tuned for GPUs and multi-GPUs, and is not suited to CPUs. Furthermore, Python itself is not suited to multi-threading or multi-processing.

However, there are various libraries for speeding things up, as there is a lot of demand for inferencing on CPUs. They are a little difficult to use, but I think it would be a good idea to try them out.

Topic		Replies	Views
Offloading LLM models to CPU uses only single core 🤗Transformers	1	4032	June 3, 2024
Generate text on multiple GPU 🤗Transformers	2	1305	May 10, 2021
Speed up the prediction in transformers models 🤗Transformers	0	667	November 23, 2021
Multi-GPU eval in PyTorch training loop with generate method 🤗Accelerate	1	2071	August 30, 2022
When I try to inference on multiple GPUs using multiple processes, the time for model. generate() becomes very long 🤗Transformers	0	480	June 12, 2023

How to make model.generate() process using multiple CPU cores?

Related topics