Cannot run large models using API token

Hi @radames

Many thanks for letting me know. I have two follow up questions:

  1. Is there a way to load one of your large models locally for inference using model parallelism without having to manually edit the model’s internal code? I have access to multiple GPUs, and a single GPU with 24GB cannot load a 20 billion parameters model in memory, but with multiple GPUs it would be possible, however when I load the model it only tries to load it onto a single GPU and then gives an out of memory error.

  2. Why are the results given by the UI when I run a model on the hub non-deterministic, but when I download it locally and run inference it appears to be deterministic and give a different output than the one I get on the hub? I am talking mostly about generative models.

Thanks a lot

1 Like