Hi,
So I need to load multiple large models in a single script and control which GPUs they are kept on. For example, lets say I want to load one LLM on the first 4 GPUs and the another LLM on the last 4 GPUs. If I pass “auto” to the device_map, it will always use all GPUs. I cannot use CUDA_VISIBLE_DEVICES since I need all of them to be visible in the script.
For example, what would be the correct arguemnt to device map to load LLama 3.1 on GPUs 0,1,2,3?
pipeline = transformers.pipeline(
"text-generation",
model="meta-llama/Llama-3.1-70B-Instruct",
model_kwargs={"torch_dtype": torch.bfloat16},
device_map=llm_device,
token=ACCESS_TOKEN,
)