Cannot run large models using API token

AndreaSottana · February 17, 2023, 10:22am

Many thanks for letting me know. I have two follow up questions:

Is there a way to load one of your large models locally for inference using model parallelism without having to manually edit the model’s internal code? I have access to multiple GPUs, and a single GPU with 24GB cannot load a 20 billion parameters model in memory, but with multiple GPUs it would be possible, however when I load the model it only tries to load it onto a single GPU and then gives an out of memory error.
Why are the results given by the UI when I run a model on the hub non-deterministic, but when I download it locally and run inference it appears to be deterministic and give a different output than the one I get on the hub? I am talking mostly about generative models.

Thanks a lot

Topic		Replies	Views
Inference service for large models, such as Vicuna 13b Beginners	0	1427	May 5, 2023
PRO Plan and for running huge models on free inference api? Beginners	1	1806	May 15, 2023
Inference API stopped working for my model 🤗Hub	11	5379	April 26, 2023
The model mistralai/Mistral-7B-Instruct-v0.1 is too large to be loaded automatically (14GB > 10GB) Models	2	187	April 15, 2025
Inference API stopped working Inference Endpoints on the Hub	50	4649	June 8, 2025