When I deployed Mistral-Large-Instruct-2407 on a multi-GPU server, I set GPU usage to âautoâ, but the returned data was very slow. I wanted to try running my 8 A100 80Gb servers at full speed, but debugging multi-GPU settings, including workers, threads, GPU limits, etc., always resulted in GPU memory being fully occupied. Sometimes, when it worked, I encountered errors when child threads tried to use the model pre-loaded into GPU by the main thread. I implemented a â swapâ solution, but it still tells me to use â swapâ .
I havenât seen any official sample code on Hugging Face or GitHub. Iâm seeking guidance from everyone.