Running Mistral-7B-Instruct-v0.2 on multiple GPUs

I’m trying to run a pretty straightforward script. I just want to experiment running my own chat offline on my setup using Mistral-7B-Instruct-v0.2 model. My setup is relatively old, I helped some researchers with it back in the day. It’s four Geforce GTX 1080 cards, with 8 GB RAM each.

If my script looks overly complicated, it’s because I’ve been manipulating it a lot wishing that it could run. Here’s the script:

import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM

def generate_text(input_text, num_texts=2, max_length=100, num_beams=5, early_stopping=True):
    # Set the GPUs to use
    device_ids = [0, 1, 2, 3]  # Modify this list according to your GPU configuration
    primary_device = f'cuda:{device_ids[0]}'  # Primary device
    torch.cuda.set_device(primary_device)

    # Load the tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2")
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model = AutoModelForCausalLM.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2").to(primary_device)

    # Move model to GPUs
    model = torch.nn.DataParallel(model, device_ids=device_ids)

    # Tokenize the input text and move to the primary device
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, padding=True, truncation=True)
    inputs = {k: v.to(primary_device) for k, v in inputs.items()}

    # Generate multiple texts using different random seeds
    generated_texts = []
    for i in range(num_texts):
        # Set the random seed for reproducibility
        torch.manual_seed(i)

        # Generate the text using the model
        with torch.no_grad():
            outputs = model.module.generate(**inputs, max_length=max_length, num_beams=num_beams, early_stopping=early_stopping)

        # Decode and add the generated text to the list
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        generated_texts.append(generated_text)

    return generated_texts

if __name__ == "__main__":
    # Set the input text and style
    input_text = "Tell me a story about a dragon and a princess."

    # Generate texts
    generated_texts = generate_text(input_text)

    # Write the generated texts to a JSON file
    with open("generated_texts.json", "w") as f:
        json.dump(generated_texts, f)

This is my output:

Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.16it/s]
Traceback (most recent call last):
  File "myscript.py", line 44, in <module>
    generated_texts = generate_text(input_text)
  File "myscript.py", line 14, in generate_text
    model = AutoModelForCausalLM.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2").to(primary_device)
  File "/home/user/Transformers/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2556, in to
    return super().to(*args, **kwargs)
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 7.92 GiB of which 86.81 MiB is free. Including non-PyTorch memory, this process has 7.12 GiB memory in use. Of the allocated memory 7.02 GiB is allocated by PyTorch, and 1.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Basically, as I show in my screenshot, I manage to make it run on only one GPU. I can pick the GPU but I cant make it use the other 3. Correct me if I’m wrong, but it is now my understanding that I have to manually split the model in the 4 GPUs? That is, the model must fit entirely in one GPU even if I want to use all 4 GPUs?

Mistral says on their site that the model requires 16GB. Each GPU has 8GB. I’ve tried to search about model parallelism and pipeline parallelism, sharded data parallelism, I don’t find much regarding this model in particular but mostly these are concepts I don’t have experience on.

Do I need to put the SLI on? back in the day you didn’t need it for training, but this is for inference.

This leaves me with the question, what about bigger models like Mistral-8X7B-v0.1 that require 100GB? A100s only have 80GB, the entire model doesn’t fit in a single GPU.

I’m aware I could run it on some cloud architecture but it kind of beats the purpose of what I’m trying to do at the moment, and it implies spending resources I could probably spare since I have this setup, why not use it?

I hope you could guide me on this. Thanks a lot.

I already tried specifying the GPUs that are visible on the script. I also specified all 4 GPUs as visible on .bashrc export CUDA_VISIBLE_DEVICES=0,1,2,3, but still only one GPU is used.

Hi @blutzauber!

The function from_pretrained() has a parameter device_map that automatically distributes the model with device_map="auto" to your GPUs.
So you would only have to change the following:

AutoModelForCausalLM.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2", device_map="auto")

I hope this helps you.

1 Like

Worked like a dream! Thanks. I didn’t have to do anything else, just install an extra package.

Sadly, can’t do anything with more than 5 beams and 110 tokens, fairly short texts. I may still want to upgrade the gears. Also, every text is exactly the same, even though the seeds are different!
If you have any idea as to why I’ll be so grateful, so many thanks!

Great, I’m glad to hear that it worked out!

What kind of problems do you get when generating? Also an OutOfMemoryError?
If so, you could load the model quantized with:

AutoModelForCausalLM.from_pretrained(“MistralAI/Mistral-7B-Instruct-v0.2”, device_map=“auto”, load_in_4bit=True)

You will probably need to install the bityandbytes package for this.

Here is a great blog to learn more about quantization.

To obtain different outputs from the generation, you can set the following parameters:

model.generate(**inputs, do_sample=True, num_beams=4, top_k=40, temperature=1.0)
  • greedy decoding by calling greedy_search() if num_beams=1 and do_sample=False.
  • multinomial sampling by calling sample() if num_beams=1 and do_sample=True.
  • beam-search decoding by calling beam_search() if num_beams>1 and do_sample=False.
  • beam-search multinomial sampling by calling beam_sample() if num_beams>1 and do_sample=True.
  • diverse beam-search decoding by calling group_beam_search(), if num_beams>1 and num_beam_groups>1.
  • constrained beam-search decoding by calling constrained_beam_search(), if constraints!=None or force_words_ids!=None.

Here you can find more about the generation parameters.

I hope this helps you. :slight_smile:

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.