Why transformers doesn't use Multiple GPUs (to increase tokens per second)?

this is my current code to load llama 3.1 8b instruct model into local windows 10 pc,
i tried many methods to get it to run on multiple GPUs (in order to increase tokens per second) but without success,

the model loads onto the GPU:0 and GPU:1 stay idle, and the generation on average reaches a 12-13 tokens per second,

if i use device_map=“auto” then it deploy the model on both GPUs but on CPU as well (and then the token per second drops to ~ 5 tokens per sec)

import transformers
import torch
import bitsandbytes as bnb


try:
    tokenizer = transformers.AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
    tokenizer.pad_token_id = tokenizer.eos_token_id
    model = transformers.AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16, load_in_8bit=False, attn_implementation="flash_attention_2", device_map="cuda")
    
    # Wrap the model using PyTorch's DataParallel for multi-GPU usage if more than one GPU is available
    # This will allow your model to split the input data across your GPUs, improving performance for large models.
    if torch.cuda.device_count() > 1:
        model = torch.nn.DataParallel(model)
        print(f"Using {torch.cuda.device_count()} GPUs for inference.")
    else:
        model = model
        print("Using single GPU for inference.")
                    
    # CUDA optimization settings
    torch.backends.cuda.cufft_plan_cache.clear()  # Clear CUFFT plan cache to avoid memory issues
    torch.backends.cuda.matmul.allow_tf32 = True  # Enable TensorFloat-32 to improve speed on RTX 3090
    torch.backends.cudnn.benchmark = True  # Enable cuDNN benchmark mode to improve speed
    torch.backends.cudnn.deterministic = False  # Disable cuDNN deterministic mode to improve speed

    message_history = [{"role": "user", "content": "hello"}]

    messages = tokenizer.apply_chat_template(message_history, tokenize=False, add_generation_prompt=True, return_tensors="pt")
    model_inputs = tokenizer([messages], truncation=True, padding=True, return_tensors="pt").to('cuda')
    model_inputs['attention_mask'] = (model_inputs['input_ids'] != tokenizer.pad_token_id).long().to('cuda')
    input_ids = model_inputs['input_ids'].to('cuda')
    attention_mask = model_inputs['attention_mask'].to('cuda')

    with torch.no_grad():
        # If the model is wrapped with DataParallel, use model.module
        if isinstance(model, torch.nn.DataParallel):
            model_to_use = model.module
        else:
            model_to_use = model

        response_tensor = model_to_use.generate(
            input_ids = input_ids,
            attention_mask = attention_mask,
            pad_token_id = tokenizer.eos_token_id,
            max_new_tokens = 2048,
            do_sample = True,
            top_k = 150,
            top_p = 0.95,
            temperature = 0.75,
            num_beams = 1,
        )

        # Decode the generated response
        response = tokenizer.decode(response_tensor[0], skip_prompt=True, skip_special_tokens=True)

        print(f"Response: {response}")
except Exception as err:
    print(f"Error occurred while generating response: {err}")


any ideas would be appreciated,
code sample would be even better

10x

Possibly:

1 Like

@John6666 thanks for your answer, the suggestion you provided not seems to be the exact one i needed, since it was talking about very large models that couldn’t fit into a single GPU memory, and it wanted to use some left over of a 3thd GPU that was under utilized (more of a sharding problem), and also suggested using Accelerate which is automation and abstraction of the process, which is good on its own, but i want to try the lower level interfaces first in order to gain a deeper understanding before using more and more frameworks,

after reading more about it it looks like the torch.nn.parallel.DistributedDataParallel (DDP) is a better way to go then torch.nn.DataParallel (DP)

also i need to create a server that will serve more then a single user in order to be able to distribute the model generation work between multiple GPUs (batch_size should be > 1)

I see, so you’re not talking about bugs or anything.
You may have already tried this, but how about using “cuda:0” or “cuda:1” or integer (.to(device=0)) instead of “cuda”?
With “cuda”, I think “cuda:0” or int(0) (GPU:1) is implicitly used.

that is a good idea, but it means a manual management of the distribution

I think it would be a reinvention of accelerate, but if you’re going to do it manually, why not do it like this?

def get_idle_gpu():
    import torch
    import random
    gpu_num = torch.cuda.device_count()
    device_num = random.randint(0, gpu_num) # In practice, it returns a GPU that are relatively idle.
    return device_num

~~.to(device=get_idle_gpu())

can you please explain why you think random would produce idle GPU number (it might supply a number of a busy and loaded GPU, after all - it is random)

and i did not mean i want to manage the distribution of model and inference calculations over GPUs manually,

i meant that if i were to use your direct GPU address (cuda:0) while loading model i would infact do a manual allocation/deployment/distribution (and i prefer not to, i rather understand how to use pytorch in order to distribute the workload and model in optimized manner, and how to load test it and monitor it so i can proof to myself the code is working)

Apologies. The random function is just a sample to fill in the blanks and has no deeper meaning whatsoever.:cold_sweat:

I probably don’t understand what you want to do properly, but if you just want to do some automatic and fast distributed processing to the extent that you don’t have to rely on accelerate, you could get by with the following code.

model = torch.nn.DataParallel(model, device_ids=[int(i) for i in range(int(torch.cuda.device_count()))])

Once explicitly assigned, it should be distributed on its own when you call .to(“cuda”).

Also, if the aim is not distributed processing or exploration itself, but practical use, then 4-bit quantization would save some resources. It may not be suitable for some use cases, but…