Am I doing multiple GPU right?

Hey all,

I am using a local HPC to try and train LLMs, all as a test. I have been able to train GPT2 and smaller LLMs no problem. But now I am trying to train EleutherAI/gpt-neo-2.7B and I seem to need a bit more VRAM.
Well okay, I will use a system with multiple GPUs! I have limited access to a system with a few NVIDIA A100-SXM4-40GB. So I made the following python script:

import subprocess
from transformers import AutoTokenizer, GPTNeoForCausalLM, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling, TextDataset
from accelerate import Accelerator
import torch
import os

# Set CUDA_LAUNCH_BLOCKING=1 to get more detailed error messages
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

def print_gpu_memory():
    result = subprocess.run(['nvidia-smi'], stdout=subprocess.PIPE)
    print(result.stdout.decode('utf-8'))

# Initialize the accelerator
accelerator = Accelerator()

# Print initial GPU memory usage
print("Initial GPU memory usage:")
print_gpu_memory()

# Load the model and tokenizer
model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPTNeoForCausalLM.from_pretrained(model_name, gradient_checkpointing=True)

# Add a padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
    model.resize_token_embeddings(len(tokenizer))


# Print GPU memory usage after loading model and tokenizer
print("After loading model and tokenizer:")
print_gpu_memory()

# Load the text from the file
file_path = "cleaned.txt"

# Load dataset
print("Just before loading dataset")
print_gpu_memory()

train_dataset = TextDataset(tokenizer=tokenizer, file_path=file_path, block_size=512)
print(f"Number of samples in the dataset: {len(train_dataset)}")

# Print GPU memory usage after loading dataset
print("After loading dataset:")
print_gpu_memory()

# Define data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,  # Accumulate gradients to simulate a larger batch size
    save_steps=10_000,
    save_total_limit=2,
    learning_rate=1e-4,
    dataloader_num_workers=4,
    fp16=True
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

# Prepare everything with accelerator
model, train_dataset, training_args = accelerator.prepare(
    model, train_dataset, training_args
)

# Print GPU memory usage after preparing with accelerator
print("After preparing with accelerator:")
print_gpu_memory()

torch.cuda.empty_cache()

# Train the model
print("Just before training")
print_gpu_memory()

training_successful = False

try:
    # sync GPUs
    accelerator.wait_for_everyone()
    trainer.train()
    training_successful = True
except RuntimeError as e:
    training_successful = False
    print(f"Training failed with error: {e}")

# Print GPU memory usage after training
print("After training:")
print_gpu_memory()

# Save the fine-tuned model and tokenizer
if training_successful:
    model.save_pretrained('./trained_models/neo_2_epochs')

# Print GPU memory usage after saving the model
print("After saving the model:")
print_gpu_memory()

But what I see is that everything is loaded on just one GPU:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:31:00.0 Off |                  Off |
| N/A   32C    P0             70W /  400W |   40445MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:32:00.0 Off |                  Off |
| N/A   29C    P0             65W /  400W |       5MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

And it dies with a

Training failed with error: CUDA out of memory. Tried to allocate 100.00 MiB. GPU

Can anyone please tell me what I am doing wrong?

1 Like

It looks like the GPU itself is not being recognized by torch.
A common case is when torch has been installed in the CPU version instead of the CUDA version. Please try reinstalling the appropriate version by selecting your environment from the link below.

Heya! Thanks for the suggestion. I am not sure that is the case tho.

I added the following to the script:

import torch
for i in range(torch.cuda.device_count()):
   print(torch.cuda.get_device_properties(i).name)

And it returned:

NVIDIA A100-SXM4-40GB
NVIDIA A100-SXM4-40GB

So it seems to recognize them?

1 Like

I overlooked the fact that you were able to use one of the GPUs…:sweat_smile:
For normal inference, you don’t need to do much special configuration for multi-GPU (the accelerate library will handle it for you), but it seems that you need to manually configure various settings when training the model.

1 Like

Happens to the best.
Thanks, I will read the link you gave.

1 Like

Apparently I am still being stupid, I do get “non gradient” errors. If I check all parameters from the model, it seems they are all non gradient?

import subprocess
from transformers import AutoTokenizer, GPTNeoForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling, TextDataset
from accelerate import Accelerator
import torch
import os
import deepspeed

# Set CUDA_LAUNCH_BLOCKING=1 to get more detailed error messages
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# Print the available GPUs
for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_properties(i).name)

def print_gpu_memory():
    result = subprocess.run(['nvidia-smi'], stdout=subprocess.PIPE)
    print(result.stdout.decode('utf-8'))

# Initialize the accelerator
accelerator = Accelerator()

# Print initial GPU memory usage
# print("Initial GPU memory usage:")
# print_gpu_memory()

# Load the model and tokenizer
model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPTNeoForCausalLM.from_pretrained(model_name, gradient_checkpointing=True)

# Add a padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
    model.resize_token_embeddings(len(tokenizer))

# Print GPU memory usage after loading model and tokenizer
# print("After loading model and tokenizer:")
# print_gpu_memory()

# Load the book text from the file
file_path = "./processed_data/cleaned.txt"

# Load dataset
# print("Just before loading dataset")
# print_gpu_memory()

train_dataset = TextDataset(tokenizer=tokenizer, file_path=file_path, block_size=512)
# train_dataset = torch.utils.data.Subset(train_dataset, range(1))
print(f"Number of samples in the dataset: {len(train_dataset)}")

# Print GPU memory usage after loading dataset
# print("After loading dataset:")
# print_gpu_memory()

# Define data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,  # Accumulate gradients to simulate a larger batch size
    save_steps=10_000,
    save_total_limit=2,
    learning_rate=1e-4,
    dataloader_num_workers=4,
    fp16=True,
    deepspeed="./deepspeed_confs/neogpt_2B7.json"
)

class CustomTrainer(Trainer):
    def training_step(self, model, inputs):
        model.train()
        inputs = self._prepare_inputs(inputs)
        outputs = model(**inputs)
        loss = outputs.loss
        accelerator.backward(loss)
        check_none_gradients(model)  # Check for None gradients
        return loss.detach()

def check_none_gradients(model):
    for name, param in model.named_parameters():
        if param.grad is None:
            print(f"Parameter {name} has None gradient")

# Ensure the model is in training mode
model.train()

# Initialize the Trainer
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

# Prepare everything with accelerator
model, train_dataset, training_args = accelerator.prepare(
    model, train_dataset, training_args
)

# Print GPU memory usage after preparing with accelerator
# print("After preparing with accelerator:")
# print_gpu_memory()

torch.cuda.empty_cache()

# Train the model
# print("Just before training")
# print_gpu_memory()

training_successful = False

try:
    # sync GPUs
    accelerator.wait_for_everyone()
    
    # Access the original model's configuration if wrapped in DDP
    original_model = model.module if hasattr(model, 'module') else model
    
    # Disable use_cache if gradient checkpointing is enabled
    if hasattr(original_model.config, 'gradient_checkpointing') and original_model.config.gradient_checkpointing:
        original_model.config.use_cache = False
    
    # Train the model
    trainer.train()
    
    training_successful = True
except RuntimeError as e:
    training_successful = False
    print(f"Training failed with error: {e}")

# Print GPU memory usage after training
# print("After training:")
print_gpu_memory()

# Save the fine-tuned model and tokenizer
if training_successful:
    model.save_pretrained('./trained_models/neo_2_epochs')

# Print GPU memory usage after saving the model
# print("After saving the model:")
# print_gpu_memory()

# Check for None gradients after training
check_none_gradients(model)

With the json

{
  "train_batch_size": "auto",
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    }
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-4,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0
    }
  }
}

Am I doing something completely wrong or?

1 Like

That’s a difficult error… Is it this? The one below is just a torch version error.

Hmmm, I seem to have forgotten to post my output… Smart me.

[2024-11-29 14:33:06,356] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-29 14:33:09,315] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2024-11-29 14:33:09,350] [INFO] [runner.py:555:main] cmd = /home/NotEnoughVRAM /.conda/envs/LLM_Trainer/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None gpt_neo_2B7_finetune.py
[2024-11-29 14:33:10,819] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-29 14:33:11,739] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-11-29 14:33:11,739] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-11-29 14:33:11,739] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-11-29 14:33:11,739] [INFO] [launch.py:163:main] dist_world_size=2
[2024-11-29 14:33:11,739] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-11-29 14:33:14,109] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-29 14:33:14,109] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
NVIDIA A100-SXM4-40GBNVIDIA A100-SXM4-40GB

NVIDIA A100-SXM4-40GBNVIDIA A100-SXM4-40GB

/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Using pad_token, but it is not set yet.
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/data/datasets/language_modeling.py:53: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py
  warnings.warn(
Using pad_token, but it is not set yet.
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/data/datasets/language_modeling.py:53: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py
  warnings.warn(
Number of samples in the dataset: 3256
[2024-11-29 14:33:45,099] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-11-29 14:33:45,100] [INFO] [comm.py:594:init_distributed] cdb=None
Number of samples in the dataset: 3256
[2024-11-29 14:33:45,167] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-11-29 14:33:45,167] [INFO] [comm.py:594:init_distributed] cdb=None
e[93m [WARNING] e[0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!e[93m [WARNING] e[0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!

Using /gpfs/home1/NotEnoughVRAM /.cache/torch_extensions/py38_cu121 as PyTorch extensions root...Using /gpfs/home1/NotEnoughVRAM /.cache/torch_extensions/py38_cu121 as PyTorch extensions root...

Emitting ninja build file /gpfs/home1/NotEnoughVRAM /.cache/torch_extensions/py38_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7574810981750488 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7965133190155029 seconds
Parameter Offload: Total persistent parameters: 824320 in 226 params

  0%|          | 0/406 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905975447/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905975447/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Parameter module.transformer.wpe.weight has None gradient
Parameter module.transformer.h.0.ln_1.weight has None gradient
Parameter module.transformer.h.0.ln_1.bias has None gradient
Parameter module.transformer.h.0.attn.attention.k_proj.weight has None gradient
Parameter module.transformer.h.0.attn.attention.v_proj.weight has None gradient
Parameter module.transformer.h.0.attn.attention.q_proj.weight has None gradient
Parameter module.transformer.h.0.attn.attention.out_proj.weight has None gradient
Parameter module.transformer.h.0.attn.attention.out_proj.bias has None gradient
Parameter module.transformer.h.0.ln_2.weight has None gradient
Parameter module.transformer.h.0.ln_2.bias has None gradientParameter module.transformer.wpe.weight has None gradient

<SNIP due to char limit of discourse>

Parameter module.transformer.h.31.mlp.c_proj.bias has None gradient
Parameter module.transformer.ln_f.weight has None gradient
Parameter module.transformer.ln_f.bias has None gradient

  0%|          | 1/406 [00:57<6:25:55, 57.17s/it][rank1]: Traceback (most recent call last):
[rank1]:   File "gpt_neo_2B7_finetune.py", line 130, in <module>
[rank1]:     trainer.train()
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs)
[rank1]:   File "gpt_neo_2B7_finetune.py", line 81, in training_step
[rank1]:     accelerator.backward(loss)
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/accelerate/accelerator.py", line 1316, in backward
[rank1]:     loss.backward(**kwargs)
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/_tensor.py", line 525, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/function.py", line 301, in apply
[rank1]:     return user_fn(self, *args)
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 320, in backward
[rank1]:     torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1006, in reduce_partition_and_remove_grads
[rank1]:     self.reduce_ready_partitions_and_remove_grads(param, i)
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1286, in reduce_ready_partitions_and_remove_grads
[rank1]:     self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1041, in reduce_independent_p_g_buckets_and_remove_grads
[rank1]:     self.__reduce_and_partition_ipg_grads()
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1069, in __reduce_and_partition_ipg_grads
[rank1]:     if param.grad.numel() != param.ds_numel:
[rank1]: AttributeError: 'NoneType' object has no attribute 'numel'
[rank0]: Traceback (most recent call last):
[rank0]:   File "gpt_neo_2B7_finetune.py", line 130, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:   File "gpt_neo_2B7_finetune.py", line 81, in training_step
[rank0]:     accelerator.backward(loss)
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/accelerate/accelerator.py", line 1316, in backward
[rank0]:     loss.backward(**kwargs)
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/_tensor.py", line 525, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/function.py", line 301, in apply
[rank0]:     return user_fn(self, *args)
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 320, in backward
[rank0]:     torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1006, in reduce_partition_and_remove_grads
[rank0]:     self.reduce_ready_partitions_and_remove_grads(param, i)
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1286, in reduce_ready_partitions_and_remove_grads
[rank0]:     self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1041, in reduce_independent_p_g_buckets_and_remove_grads
[rank0]:     self.__reduce_and_partition_ipg_grads()
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1069, in __reduce_and_partition_ipg_grads
[rank0]:     if param.grad.numel() != param.ds_numel:
[rank0]: AttributeError: 'NoneType' object has no attribute 'numel'

That last error is due to the NoneType…
But as you can see, it thinks the entire model has non gradients.

So I requested the versions:

Python 3.8.13
torch version: 2.3.1
transformers version: 4.28.1
accelerate version: 0.15.0
deepspeed version: 0.9.5

I do not have pytorch-lightning installed.

I will try another model, maybe this one just clashes with that combination of lib versions.

It’s close to the former because it’s in a DDP environment, but it’s not exactly right.