I’m trying to run a pretty straightforward script. I just want to experiment running my own chat offline on my setup using Mistral-7B-Instruct-v0.2 model. My setup is relatively old, I helped some researchers with it back in the day. It’s four Geforce GTX 1080 cards, with 8 GB RAM each.
If my script looks overly complicated, it’s because I’ve been manipulating it a lot wishing that it could run. Here’s the script:
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM
def generate_text(input_text, num_texts=2, max_length=100, num_beams=5, early_stopping=True):
# Set the GPUs to use
device_ids = [0, 1, 2, 3] # Modify this list according to your GPU configuration
primary_device = f'cuda:{device_ids[0]}' # Primary device
torch.cuda.set_device(primary_device)
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model = AutoModelForCausalLM.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2").to(primary_device)
# Move model to GPUs
model = torch.nn.DataParallel(model, device_ids=device_ids)
# Tokenize the input text and move to the primary device
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, padding=True, truncation=True)
inputs = {k: v.to(primary_device) for k, v in inputs.items()}
# Generate multiple texts using different random seeds
generated_texts = []
for i in range(num_texts):
# Set the random seed for reproducibility
torch.manual_seed(i)
# Generate the text using the model
with torch.no_grad():
outputs = model.module.generate(**inputs, max_length=max_length, num_beams=num_beams, early_stopping=early_stopping)
# Decode and add the generated text to the list
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
generated_texts.append(generated_text)
return generated_texts
if __name__ == "__main__":
# Set the input text and style
input_text = "Tell me a story about a dragon and a princess."
# Generate texts
generated_texts = generate_text(input_text)
# Write the generated texts to a JSON file
with open("generated_texts.json", "w") as f:
json.dump(generated_texts, f)
This is my output:
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.16it/s]
Traceback (most recent call last):
File "myscript.py", line 44, in <module>
generated_texts = generate_text(input_text)
File "myscript.py", line 14, in generate_text
model = AutoModelForCausalLM.from_pretrained("MistralAI/Mistral-7B-Instruct-v0.2").to(primary_device)
File "/home/user/Transformers/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2556, in to
return super().to(*args, **kwargs)
File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/user/Transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 7.92 GiB of which 86.81 MiB is free. Including non-PyTorch memory, this process has 7.12 GiB memory in use. Of the allocated memory 7.02 GiB is allocated by PyTorch, and 1.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Basically, as I show in my screenshot, I manage to make it run on only one GPU. I can pick the GPU but I cant make it use the other 3. Correct me if I’m wrong, but it is now my understanding that I have to manually split the model in the 4 GPUs? That is, the model must fit entirely in one GPU even if I want to use all 4 GPUs?
Mistral says on their site that the model requires 16GB. Each GPU has 8GB. I’ve tried to search about model parallelism and pipeline parallelism, sharded data parallelism, I don’t find much regarding this model in particular but mostly these are concepts I don’t have experience on.
Do I need to put the SLI on? back in the day you didn’t need it for training, but this is for inference.
This leaves me with the question, what about bigger models like Mistral-8X7B-v0.1 that require 100GB? A100s only have 80GB, the entire model doesn’t fit in a single GPU.
I’m aware I could run it on some cloud architecture but it kind of beats the purpose of what I’m trying to do at the moment, and it implies spending resources I could probably spare since I have this setup, why not use it?
I hope you could guide me on this. Thanks a lot.
I already tried specifying the GPUs that are visible on the script. I also specified all 4 GPUs as visible on .bashrc export CUDA_VISIBLE_DEVICES=0,1,2,3
, but still only one GPU is used.