How to generate with a single gpu when a model is loaded onto multiple gpus?


I am currently using the llama 2 7b chat model. I am trying to run inference on inputs with very high token size, so my thoughts were to distribute the model across multiple gpus, and run inference and generation only on one of them.

Having read the documentation on handing big models , I tried doing this using AutoModelForCausalLM.from_pretrained(model_id, device_map='balanced_low_0).

I tokenize using
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokens = tokenizer.encode(prompt).to('cuda:0')

However, when I try model.generate(tokens, ...) I get the error
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!”

Can anyone help me run model.generate on only one gpu (‘cuda:0’) while storing the model on the rest of the gpus?

Using 2 RTX6000