Running Mistral-7B-Instruct-v0.2 on multiple GPUs

blutzauber · March 10, 2024, 11:46pm

I’m trying to run a pretty straightforward script. I just want to experiment running my own chat offline on my setup using Mistral-7B-Instruct-v0.2 model. My setup is relatively old, I helped some researchers with it back in the day. It’s four Geforce GTX 1080 cards, with 8 GB RAM each.

If my script looks overly complicated, it’s because I’ve been manipulating it a lot wishing that it could run. Here’s the script:

import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM

def generate_text(input_text, num_texts=2, max_length=100, num_beams=5, early_stopping=True):
    # Set the GPUs to use
    device_ids = [0, 1, 2, 3]  # Modify this list according to your GPU configuration
    primary_device = f'cuda:{device_ids[0]}'  # Primary device
    torch.cuda.set_device(primary_device)

    # Load the tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2")
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model = AutoModelForCausalLM.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2").to(primary_device)

    # Move model to GPUs
    model = torch.nn.DataParallel(model, device_ids=device_ids)

    # Tokenize the input text and move to the primary device
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, padding=True, truncation=True)
    inputs = {k: v.to(primary_device) for k, v in inputs.items()}

    # Generate multiple texts using different random seeds
    generated_texts = []
    for i in range(num_texts):
        # Set the random seed for reproducibility
        torch.manual_seed(i)

        # Generate the text using the model
        with torch.no_grad():
            outputs = model.module.generate(**inputs, max_length=max_length, num_beams=num_beams, early_stopping=early_stopping)

        # Decode and add the generated text to the list
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        generated_texts.append(generated_text)

    return generated_texts

if __name__ == "__main__":
    # Set the input text and style
    input_text = "Tell me a story about a dragon and a princess."

    # Generate texts
    generated_texts = generate_text(input_text)

    # Write the generated texts to a JSON file
    with open("generated_texts.json", "w") as f:
        json.dump(generated_texts, f)

This is my output:

Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.16it/s]
Traceback (most recent call last):
  File "myscript.py", line 44, in <module>
    generated_texts = generate_text(input_text)
  File "myscript.py", line 14, in generate_text
    model = AutoModelForCausalLM.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2").to(primary_device)
  File "/home/user/Transformers/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2556, in to
    return super().to(*args, **kwargs)
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 7.92 GiB of which 86.81 MiB is free. Including non-PyTorch memory, this process has 7.12 GiB memory in use. Of the allocated memory 7.02 GiB is allocated by PyTorch, and 1.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Basically, as I show in my screenshot, I manage to make it run on only one GPU. I can pick the GPU but I cant make it use the other 3. Correct me if I’m wrong, but it is now my understanding that I have to manually split the model in the 4 GPUs? That is, the model must fit entirely in one GPU even if I want to use all 4 GPUs?

Mistral says on their site that the model requires 16GB. Each GPU has 8GB. I’ve tried to search about model parallelism and pipeline parallelism, sharded data parallelism, I don’t find much regarding this model in particular but mostly these are concepts I don’t have experience on.

Do I need to put the SLI on? back in the day you didn’t need it for training, but this is for inference.

This leaves me with the question, what about bigger models like Mistral-8X7B-v0.1 that require 100GB? A100s only have 80GB, the entire model doesn’t fit in a single GPU.

I’m aware I could run it on some cloud architecture but it kind of beats the purpose of what I’m trying to do at the moment, and it implies spending resources I could probably spare since I have this setup, why not use it?

I hope you could guide me on this. Thanks a lot.

I already tried specifying the GPUs that are visible on the script. I also specified all 4 GPUs as visible on .bashrc export CUDA_VISIBLE_DEVICES=0,1,2,3, but still only one GPU is used.

CKeibel · March 11, 2024, 6:56am

Hi @blutzauber!

The function from_pretrained() has a parameter device_map that automatically distributes the model with device_map="auto" to your GPUs.
So you would only have to change the following:

AutoModelForCausalLM.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2", device_map="auto")

I hope this helps you.

blutzauber · March 13, 2024, 4:08am

Worked like a dream! Thanks. I didn’t have to do anything else, just install an extra package.

Sadly, can’t do anything with more than 5 beams and 110 tokens, fairly short texts. I may still want to upgrade the gears. Also, every text is exactly the same, even though the seeds are different!
If you have any idea as to why I’ll be so grateful, so many thanks!

CKeibel · March 13, 2024, 8:21am

Great, I’m glad to hear that it worked out!

What kind of problems do you get when generating? Also an OutOfMemoryError?
If so, you could load the model quantized with:

AutoModelForCausalLM.from_pretrained(“MistralAI/Mistral-7B-Instruct-v0.2”, device_map=“auto”, load_in_4bit=True)

You will probably need to install the bityandbytes package for this.

Here is a great blog to learn more about quantization.

To obtain different outputs from the generation, you can set the following parameters:

model.generate(**inputs, do_sample=True, num_beams=4, top_k=40, temperature=1.0)

greedy decoding by calling greedy_search() if num_beams=1 and do_sample=False.

multinomial sampling by calling sample() if num_beams=1 and do_sample=True.

beam-search decoding by calling beam_search() if num_beams>1 and do_sample=False.

beam-search multinomial sampling by calling beam_sample() if num_beams>1 and do_sample=True.

diverse beam-search decoding by calling group_beam_search(), if num_beams>1 and num_beam_groups>1.

constrained beam-search decoding by calling constrained_beam_search(), if constraints!=None or force_words_ids!=None.

Here you can find more about the generation parameters.

I hope this helps you.

system · March 13, 2024, 8:22pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multi-GPU Operation mistralai/Mistral-Large-Instruct-2407 🤗Transformers	0	38	September 7, 2024
CUDA OUT OF MEMORY on MULTI GPU 🤗Transformers	0	734	February 28, 2024
Poor performance from Mistral-7B-Instruct-v0.1 Beginners	1	1620	March 1, 2024
Generate text on multiple GPU 🤗Transformers	2	1309	May 10, 2021
Model.generate() OOM on 1 of 2 GPUs? 🤗Transformers	4	1704	March 4, 2022

Running Mistral-7B-Instruct-v0.2 on multiple GPUs

Related topics