How to parallelize inference on a quantized model

John6666 · September 24, 2024, 11:08pm

The HF manual consists of its introduction and a type that is automatically generated from the library. (They extract what’s written as comments in the code.)
The introduction often contains theoretical ideals and information that was correct at the time it was written but is now incorrect, so ultimately it is quicker to read the library code or watch and steal the work of others who have done it well.
It would be easiest if it could be fixed by updating the library…

But when it comes to multi-GPUs, few people use them on HF’s Spaces, so if it’s buggy, you’ll have to do it manually with torch.
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

Topic		Replies	Views
Why transformers doesn't use Multiple GPUs (to increase tokens per second)? Beginners	7	773	September 22, 2024
How to run inference on multigpus 🤗Accelerate	0	159	November 29, 2024
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs 🤗Accelerate	10	9858	October 16, 2024
Tensor parallelism for customized model 🤗Accelerate	0	261	September 2, 2024
General question about large model loading 🤗Accelerate	2	972	November 28, 2024

How to parallelize inference on a quantized model

Related topics