SSH connection with the remote server crashes when using device_map="auto"

ultracheese · July 10, 2024, 9:19am

I am running Mistral model on a remote SSH server. When I am trying to generate the output conditioned by input embeddings, the connection closes unexpectedly. What is interesting, that this behaviour occurs only if I am using Accelerate to try to load model onto multiple GPUs.

I am loading Mistral model as follows:

model = AutoModelForCausalLM.from_pretrained(
        "AIRI-Institute/OmniFusion", 
        subfolder="OmniMistral-v1_1/tuned-model", 
        torch_dtype=torch.bfloat16, 
        device_map="auto"
    )

Then, I run the generation: model.generate(inputs_embeds=embeddings, max_new_tokens=50).
At this point, the notebook freezes (set the logging leven to debug, but the call did not output anything), and I am getting disconnected from the remote server. Additionally checked that the model is indeed loaded onto multiple GPUs and not onto CPU, no errors here.

However, if I load the model onto a single GPU:

model = AutoModelForCausalLM.from_pretrained(
        "AIRI-Institute/OmniFusion", 
        subfolder="OmniMistral-v1_1/tuned-model", 
        torch_dtype=torch.bfloat16, 
        device_map="cuda:0"
    )

The problem disappears, model.generate(inputs_embeds=embeddings, max_new_tokens=50) works as expected and finishes in less than 2 seconds.

What could cause such behaviour?

I am not sure whether this question should be opened as an issue, so starting with asking it as a discussion topic. Thanks in advance for your answers!

Topic		Replies	Views
Inference mistral-7b instruct fully offline in Local machin Beginners	0	466	April 27, 2024
torch.nn.DataParallel Mistral-7B-Instruct RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! Beginners	1	64	August 20, 2024
Running Mistral-7B-Instruct-v0.2 on multiple GPUs Beginners	4	4299	March 13, 2024
Mistral-7B-v0.1 finetuning results in Out-of-Memory after some iterations Models	2	1195	January 19, 2024
Multi-GPU Operation mistralai/Mistral-Large-Instruct-2407 🤗Transformers	0	34	September 7, 2024

SSH connection with the remote server crashes when using device_map="auto"

Related topics