2B Model Fill Up Memory Usage on 4xA100s

BenatCambridge · April 9, 2025, 10:29pm

With 4xA100s, I am training a 2B model (Gemma-2-2b-it) to perform SequenceClassification on a custom dataset. My code is quite boilerplate, but what I don’t understand is why would model training fill up the GPU memory? As I see it, With a model like gemma-2b, the total number of parameters is 2B. If we assume full precision training, i.e. float32, then the total GPU memory required to hold the model is 2B*(32/8=4 bytes/parameter) = 8GB of memory. When the model undergoes training, there is also optimizer states, gradients, and optimizer intermediates that add into the total memory consumption. Generally, the GPU memory required to train a 2B model at full precision is ~5*8 = 40GB of GPU memory. If my understanding is correct, then how can the model nearly take up ~250GB of GPU memory? FYI I use accelerate with DeepSpeed ZeRO 3

with open("all.json", 'r') as f:
    all_data = json.load(f)
dataset = load_dataset('json', data_files={'train': 'debug.json', 'test': 'test.json'})
classes = list(set(d['label'] for d in all_data))
id2label = {i: str(c) for i, c in enumerate(classes)}
label2id = {str(c): i for i, c in enumerate(classes)}

model_path = "gemma2"

tokenizer = AutoTokenizer.from_pretrained(model_path)

def preprocess_function(example):

   labels = [1. if classes[i] == str(example["label"]) else 0. for i in range(len(classes))]

   example = tokenizer(example["text"], truncation=True)
   example['label'] = labels
   return example

tokenized_dataset = dataset.map(preprocess_function)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(
    model_path,
    num_labels=len(classes),
    id2label=id2label,
    label2id=label2id,
    problem_type = "multi_label_classification"
)

training_args = TrainingArguments(

   output_dir="model",
   per_device_train_batch_size=1,
   per_device_eval_batch_size=1,
   gradient_accumulation_steps=4,
   num_train_epochs=5,
   weight_decay=0.01,
   eval_strategy="epoch",
   save_strategy="epoch",
   logging_strategy="steps",
   logging_steps=100
)

trainer = Trainer(

   model=model,
   args=training_args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator
)

trainer.train()

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000001:00:00.0 Off |                    0 |
| N/A   42C    P0             79W /  300W |   74815MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  |   00000002:00:00.0 Off |                    0 |
| N/A   38C    P0             69W /  300W |   81031MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          On  |   00000003:00:00.0 Off |                    0 |
| N/A   39C    P0             81W /  300W |   54311MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          On  |   00000004:00:00.0 Off |                    0 |
| N/A   40C    P0             81W /  300W |   54311MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    561923      C   /home/azureuser/miniconda/bin/python        74806MiB |
|    1   N/A  N/A    561924      C   /home/azureuser/miniconda/bin/python        81022MiB |
|    2   N/A  N/A    561925      C   /home/azureuser/miniconda/bin/python        54302MiB |
|    3   N/A  N/A    561926      C   /home/azureuser/miniconda/bin/python        54302MiB |
+-----------------------------------------------------------------------------------------+

John6666 · April 10, 2025, 3:48am

Could it be that DeepSpeed has a bug…?

github.com/deepspeedai/DeepSpeed

GPU memory usage is significantly higher than estimated.

opened 03:45PM - 17 Oct 23 UTC

MIkumikumi0116

bug training

**Describe the bug** I run deepspeed.runtime.zero.stage3.estimate_zero3_model_s…tates_mem_needs_all_live(). It says in stage 3, with offload_optimizer and offload_param enabled, it requires 0.49GB of GPU memory for params, optim states and gradients ``` Estimated memory needed for params, optim states and gradients for a: HW: Setup with 1 node, 4 GPUs per node. SW: Model with 6805M total params, 131M largest layer params. per CPU | per GPU | Options 171.13GB | 0.49GB | offload_param=OffloadDeviceEnum.cpu, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=1 171.13GB | 0.49GB | offload_param=OffloadDeviceEnum.cpu, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=0 152.11GB | 3.66GB | offload_param=none, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=1 152.11GB | 3.66GB | offload_param=none, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=0 2.93GB | 29.01GB | offload_param=none, offload_optimizer=none, zero_init=1 152.11GB | 29.01GB | offload_param=none, offload_optimizer=none, zero_init=0 ``` But in training, the GPU memory usage is 14,678MiB after `deepspeed.initialize()`, 19,890MiB after the first forward `engine(input_token)`, 20,148MiB after the first backward `engine.backward(loss)` and `engine.step()`, 1,838MiB before the second forward, 21,272MiB after the second forward. And crashes during the second backward for backwarding loss for twice, I report this in another issue. [https://github.com/microsoft/DeepSpeed/issues/4528](url) I obtained the memory usage through `nvidia-smi`. The above results are obtained on 1x RTX 3090ti(24G). My question is: 1. Why deepspeed uses 14678MiB memory after `deepspeed.initialize()`. It casues oom in the first forward on 4x P100(16G). I primarily train my model on 4x P100, RTX 3090ti is just for testing. 2. How does DeepSpeed utilize GPU memory? I have enabled cpu offload for parameters and optimizer, the 1,838MiB of memory usage between the first backward and the second forward, I speculate, is due to the parameters and optimizer being offloaded to the CPU after which the parameters and optimizer are accounted for. Then, is the additional approximately 20,000MiB after the first forward all activations? If so, I have already enabled activation checkpointing, but the memory usage is still high. 3. What is the difference between `zero_init=1` and `zero_init=0`. Why is there such a big difference in CPU memory usage between setting `zero_init` to 1 and 0 when not using `offload_optimizer` and `offload_param`? How to set `zero_init`=1 or 0? I did not find any description regarding `zero_init` in the documentation. 4. Is there any chance I could continue my training on 4 or 8 P100 with 16384MiB memory? I'm finetuning llama2 with 6,700,000,000 of parameter with LoRA. Num of trainable parameters is 67,329,792 (1% of llama2). **To Reproduce** Code of llama2 model can be found here [https://github.com/facebookresearch/llama/blob/main/llama/model.py](url), I implemented LoRA myself, each lora linear has different LoRA rank. Each linear in my model is replaced to ``` class Adaptive_Linear(torch.nn.Module): def __init__( self, model_arg: dict, # batch size, max seqlen, vocab size, etc. name: str, # example: layer_list.0.attention.key lora_rank_dict: dict, # lora rank of each linear in_features: int, out_features: int ): super().__init__() self.name = name self.model_arg = model_arg self.linear = torch.nn.Linear(in_features, out_features, bias = False) self.lora_rank = lora_rank_dict[self.name] if self.lora_rank > 0: self.lora_dropout = torch.nn.Dropout(model_arg.lora_dropout) self.lora_a_linear = torch.nn.Linear(in_features, self.lora_rank, bias = False) self.lora_b_linear = torch.nn.Linear(self.lora_rank, out_features, bias = False) self.lora_a_linear.weight.data = torch.rand_like(self.lora_a_linear.weight.data) - 1 * 2 self.lora_b_linear.weight.data = torch.zeros_like(self.lora_b_linear.weight.data) def trainable_parameters(self): # only train lora matrix # the trainable parameters of my model are trainable_parameters # of all Adaptive_Linear connected together using itertools.chain. return iter((self.lora_a_linear.weight, self.lora_b_linear.weight)) def forward(self, x): hidden = self.linear(x) if self.lora_rank > 0: input = self.lora_dropout(x) lora_a = self.lora_a_linear(input) lora_b = self.lora_b_linear(lora_a) output = hidden + lora_b else: output = hidden return output ``` My dataset is ``` class Dataset(torch.utils.data.Dataset): def __init__(self, model_arg, dataset_path): self.model_arg = model_arg self.dataset_path = dataset_path self.dataset = torch.load(self.dataset_path) # List[Tuple] list of tuples of (input token list, output token list, label) token and label are int self.dataset = [data_item for data_item in self.dataset if len(data_item[0]) >= MIN_SEQ_LEN and len(data_item[0]) <= MAX_SEQ_LEN] def __len__(self): return len(self.dataset) def __getitem__(self, index): input_token, output_token, label = self.dataset return torch.tensor(input_token), torch.tensor(output_token), torch.tensor(label) ``` Collate function to pad tokens of a batch to same length ``` collate_fn(batch): input_token_list, output_token_list, label_list = zip(*batch) padded_input_token = torch.nn.utils.rnn.pad_sequence([seq for seq in input_token_list], batch_first = True) padded_output_token = torch.nn.utils.rnn.pad_sequence([seq for seq in output_token_list], batch_first = True) return padded_input_token, padded_output_token, label_list ``` Initialize model and deepspeed ``` model = Llama_2_With_Adaptive_Lora(model_arg, lora_rank_dict) model.load_state_dict(state_dict) for name, parameter in model.named_parameters(): if not ('lora' in name): parameter.requires_grad = False engine, optimizer, train_dataloader, _= deepspeed.initialize( model = model, model_parameters = model.trainable_parameters(), training_data = train_dataset, args = args, # local_rank only collate_fn = collate_fn, config = deepspeed_config ) ``` Deepspeed config ``` { "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 8, "gradient_clipping": 1.0, "steps_per_print": 10, "wall_clock_breakdown": false, "memory_breakdown": false, "optimizer": { "type": "AdamW", "params": { "lr": 2e-5, "betas": [ 0.9, 0.95 ], "eps": 1e-8, "weight_decay": 1e-4 } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 2e-5, "warmup_num_steps": 1000, "total_num_steps": 10000 } }, "bf16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 3, "offload_param": { "device": "cpu", "pin_memory": true }, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 1e7, "reduce_scatter": true, "reduce_bucket_size": 1e7, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "stage3_prefetch_bucket_size": 1e7, "stage3_param_persistence_threshold": 1e5, "stage3_max_live_parameters": 1e8, "stage3_max_reuse_distance": 1e8, "stage3_gather_16bit_weights_on_model_save": true }, "activation_checkpointing": { "partition_activations": true, "number_checkpoints": 100, "cpu_checkpointing": true, "contiguous_memory_optimization": true, "synchronize_checkpoint_boundary": false, "profile": true }, "data_types": { "grad_accum_dtype":"bf16" } } ``` Main code of training ``` engine.train() for input_token, target_token, _ in train_dataloader: input_token = input_token. to(engine.device) target_token = target_token.to(engine.device) pred_logits = engine(input_token) target_pos = (target_token != PAD_ID) # only calculate loss on non-pad token pred_logits = pred_logits[ target_pos].view(-1, model_arg.vocab_size) target_token = target_token[target_pos].view(-1) loss = torch.nn.functional.cross_entropy(pred_logits, target_token) engine.backward(loss) engine.step() ``` **Expected behavior** Finetun my model on P100 without oom **ds_report output** ds_report of machine one with 1x RTX3090ti(24G). The above GPU usage information was obtained on this machine. ``` [2023-10-17 22:28:50,665] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch'] torch version .................... 2.1.0+cu121 deepspeed install path ........... ['/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed'] deepspeed info ................... 0.11.1, unknown, unknown torch cuda version ............... 12.1 torch hip version ................ None nvcc version ..................... 11.6 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7 shared memory (/dev/shm) size .... 125.77 GB ``` ds_report of machine 2 with 8x P100(16G). I finetune my model on this machine. ``` [2023-10-17 22:28:40,428] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [YES] ...... [OKAY] fused_lion ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch'] torch version .................... 2.1.0+cu118 deepspeed install path ........... ['/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed'] deepspeed info ................... 0.11.2+unknown, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8 shared memory (/dev/shm) size .... 125.78 GB ``` **Screenshots** No Screenshots **System info (please complete the following information):** Machine 1: OS: Ubuntu 20.04.4 GPU: 1x RTX 3090ti Interconnects: No Python version: 3.11.4 Machine 2: OS: Ubuntu 16.04.6 GPU: 8x Tesla P100 Interconnects: No Python version: 3.11.5 **Launcher context** Machine 1: `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 deepspeed --include localhost:0 train.py` Machine 2: `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 deepspeed --include localhost:0,1,2,3 train.py` **Docker context** No docker **Additional context** In Reproduce, I simplified some of the code, so the error traceback is slightly different from Reproduce. Complete deepspeed output on machine 1 (1x RTX 3090) The error is trying to backward through the graph a second time. I report it in another issue. [https://github.com/microsoft/DeepSpeed/issues/4528](url) The issue is too long to submit, I cannot attach the output log here. Please check the output in the issue 4528. I apologize for any inconvenience caused. Complete deepspeed output on machine 2 (8x P100) The error is oom ``` [2023-10-17 23:10:48,978] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-17 23:10:50,117] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-10-17 23:10:50,117] [INFO] [runner.py:570:main] cmd = /home/user/miniconda3/envs/llama2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py [2023-10-17 23:10:52,522] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-17 23:10:53,468] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4]} [2023-10-17 23:10:53,468] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=5, node_rank=0 [2023-10-17 23:10:53,468] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4]}) [2023-10-17 23:10:53,468] [INFO] [launch.py:163:main] dist_world_size=5 [2023-10-17 23:10:53,468] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4 [2023-10-17 23:10:55,910] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-17 23:10:55,948] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-17 23:10:55,955] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-17 23:10:55,955] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-17 23:10:55,994] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-17 23:10:57,398] [INFO] [comm.py:637:init_distributed] cdb=None [2023-10-17 23:10:57,767] [INFO] [comm.py:637:init_distributed] cdb=None [2023-10-17 23:10:57,816] [INFO] [comm.py:637:init_distributed] cdb=None [2023-10-17 23:10:57,841] [INFO] [comm.py:637:init_distributed] cdb=None [2023-10-17 23:10:57,842] [INFO] [comm.py:637:init_distributed] cdb=None [2023-10-17 23:10:57,843] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-10-17 23:13:00,132] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.11.2+unknown, git-hash=unknown, git-branch=unknown [2023-10-17 23:13:12,414] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000100, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000100, adam_w=1 [2023-10-17 23:13:14,520] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-10-17 23:13:14,520] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000100, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000100, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000100, adam_w=1 [2023-10-17 23:13:14,561] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-10-17 23:13:14,561] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-10-17 23:13:14,561] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2023-10-17 23:13:14,561] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer [2023-10-17 23:13:15,405] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning [2023-10-17 23:13:15,406] [INFO] [utils.py:803:see_memory_usage] MA 12.99 GB Max_MA 12.99 GB CA 13.01 GB Max_CA 13 GB [2023-10-17 23:13:15,406] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 121.56 GB, percent = 48.3% [2023-10-17 23:13:15,412] [INFO] [stage3.py:126:__init__] Reduce bucket size 10000000 [2023-10-17 23:13:15,412] [INFO] [stage3.py:127:__init__] Prefetch bucket size 10000000 [2023-10-17 23:13:15,841] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2023-10-17 23:13:15,841] [INFO] [utils.py:803:see_memory_usage] MA 12.99 GB Max_MA 12.99 GB CA 13.01 GB Max_CA 13 GB [2023-10-17 23:13:15,841] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 121.56 GB, percent = 48.3% Parameter Offload: Total persistent parameters: 10247936 in 260 params [2023-10-17 23:13:28,903] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2023-10-17 23:13:28,904] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 12.99 GB CA 13.01 GB Max_CA 13 GB [2023-10-17 23:13:28,904] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 143.14 GB, percent = 56.9% [2023-10-17 23:13:29,558] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions [2023-10-17 23:13:29,558] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB [2023-10-17 23:13:29,558] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 143.04 GB, percent = 56.9% [2023-10-17 23:13:35,210] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 1 [2023-10-17 23:13:35,211] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB [2023-10-17 23:13:35,212] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 143.57 GB, percent = 57.1% [2023-10-17 23:13:35,692] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions [2023-10-17 23:13:35,693] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB [2023-10-17 23:13:35,695] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 143.57 GB, percent = 57.1% [2023-10-17 23:13:36,267] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions [2023-10-17 23:13:36,268] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB [2023-10-17 23:13:36,268] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 143.68 GB, percent = 57.1% [2023-10-17 23:13:37,129] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states [2023-10-17 23:13:37,129] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB [2023-10-17 23:13:37,129] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 144.35 GB, percent = 57.4% [2023-10-17 23:13:37,776] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states [2023-10-17 23:13:37,777] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB [2023-10-17 23:13:37,777] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 144.51 GB, percent = 57.4% [2023-10-17 23:13:37,778] [INFO] [stage3.py:459:_setup_for_real_optimizer] optimizer state initialized [2023-10-17 23:13:38,669] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer [2023-10-17 23:13:38,670] [INFO] [utils.py:803:see_memory_usage] MA 0.33 GB Max_MA 0.34 GB CA 13.01 GB Max_CA 13 GB [2023-10-17 23:13:38,670] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 114.13 GB, percent = 45.4% [2023-10-17 23:13:38,671] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw [2023-10-17 23:13:38,671] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR [2023-10-17 23:13:38,671] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7fa76bb5fe10> [2023-10-17 23:13:38,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[2e-05], mom=[[0.9, 0.95]] [2023-10-17 23:13:38,673] [INFO] [config.py:968:print] DeepSpeedEngine configuration: [2023-10-17 23:13:38,673] [INFO] [config.py:972:print] activation_checkpointing_config { "partition_activations": true, "contiguous_memory_optimization": true, "cpu_checkpointing": true, "number_checkpoints": 100, "synchronize_checkpoint_boundary": false, "profile": true } [2023-10-17 23:13:38,673] [INFO] [config.py:972:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-10-17 23:13:38,673] [INFO] [config.py:972:print] amp_enabled .................. False [2023-10-17 23:13:38,673] [INFO] [config.py:972:print] amp_params ................... False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] bfloat16_enabled ............. True [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] checkpoint_parallel_write_pipeline False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] checkpoint_tag_validation_enabled True [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] checkpoint_tag_validation_fail False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7faac78d7510> [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] communication_data_type ...... None [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] curriculum_enabled_legacy .... False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] curriculum_params_legacy ..... False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] data_efficiency_enabled ...... False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] dataloader_drop_last ......... False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] disable_allgather ............ False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] dump_state ................... False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] dynamic_loss_scale_args ...... None [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_enabled ........... False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_gas_boundary_resolution 1 [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_layer_num ......... 0 [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_max_iter .......... 100 [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_stability ......... 1e-06 [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_tol ............... 0.01 [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_verbose ........... False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] elasticity_enabled ........... False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] fp16_auto_cast ............... None [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] fp16_enabled ................. False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] fp16_master_weights_and_gradients False [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] global_rank .................. 0 [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] grad_accum_dtype ............. bf16 [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] gradient_accumulation_steps .. 8 [2023-10-17 23:13:38,674] [INFO] [config.py:972:print] gradient_clipping ............ 1.0 [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] gradient_predivide_factor .... 1.0 [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] initial_dynamic_scale ........ 1 [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] load_universal_checkpoint .... False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] loss_scale ................... 1.0 [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] memory_breakdown ............. False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] mics_hierarchial_params_gather False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] mics_shard_size .............. -1 [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] optimizer_legacy_fusion ...... False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] optimizer_name ............... adamw [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.95], 'eps': 1e-08, 'weight_decay': 0.0001} [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] pld_enabled .................. False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] pld_params ................... False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] prescale_gradients ........... False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] scheduler_name ............... WarmupDecayLR [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 2e-05, 'warmup_num_steps': 1000, 'total_num_steps': 10000} [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] sparse_attention ............. None [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] sparse_gradients_enabled ..... False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] steps_per_print .............. 10 [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] train_batch_size ............. 40 [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] train_micro_batch_size_per_gpu 1 [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] use_node_local_storage ....... False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] wall_clock_breakdown ......... False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] weight_quantization_config ... None [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] world_size ................... 5 [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] zero_allow_untested_optimizer False [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=10000000 allgather_partitions=True allgather_bucket_size=10000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=10000000 param_persistence_threshold=100000 model_persistence_threshold=sys.maxsize max_live_parameters=100000000 max_reuse_distance=100000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] zero_enabled ................. True [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] zero_force_ds_cpu_optimizer .. True [2023-10-17 23:13:38,675] [INFO] [config.py:972:print] zero_optimization_stage ...... 3 [2023-10-17 23:13:38,676] [INFO] [config.py:958:print_user_config] json = { "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 8, "gradient_clipping": 1.0, "steps_per_print": 10, "wall_clock_breakdown": false, "memory_breakdown": false, "optimizer": { "type": "AdamW", "params": { "lr": 2e-05, "betas": [0.9, 0.95], "eps": 1e-08, "weight_decay": 0.0001 } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 2e-05, "warmup_num_steps": 1000, "total_num_steps": 1.000000e+04 } }, "bf16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 3, "offload_param": { "device": "cpu", "pin_memory": true }, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+07, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+07, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+09, "stage3_prefetch_bucket_size": 1.000000e+07, "stage3_param_persistence_threshold": 1.000000e+05, "stage3_max_live_parameters": 1.000000e+08, "stage3_max_reuse_distance": 1.000000e+08, "stage3_gather_16bit_weights_on_model_save": true }, "activation_checkpointing": { "partition_activations": true, "number_checkpoints": 100, "cpu_checkpointing": true, "contiguous_memory_optimization": true, "synchronize_checkpoint_boundary": false, "profile": true }, "data_types": { "grad_accum_dtype": "bf16" } } Traceback (most recent call last): File "/data/user/llama2/src/7_train/train.py", line 449, in <module> trainer.train_train() File "/data/user/llama2/src/7_train/train.py", line 408, in train_train self.train_epoch(epoch) File "/data/user/llama2/src/7_train/train.py", line 430, in train_epoch pred_logits = self.model_engine(input_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1807, in forward loss = self.module(*inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/user/llama2/src/7_train/../0_all_types_of_llama2/Llama_2_Adaptive_Lora.py", line 275, in forward hidden = layer(hidden, start_pos, freqs_cis, mask) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/user/llama2/src/7_train/../0_all_types_of_llama2/Llama_2_Adaptive_Lora.py", line 230, in forward hidden = hidden + self.feed_forward(self.ffn_norm(hidden)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/user/llama2/src/7_train/../0_all_types_of_llama2/Llama_2_Adaptive_Lora.py", line 198, in forward hidden = self.w2(hidden) ^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/user/llama2/src/7_train/../0_all_types_of_llama2/Llama_2_Adaptive_Lora.py", line 30, in forward hidden = self.linear(x) ^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1557, in _call_impl args_result = hook(self, args) ^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module self.__all_gather_params(params_to_fetch, forward) File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gather_params_ handle = partitioned_params[0].all_gather_coalesced(partitioned_params, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1109, in all_gather_coalesced param_buffer = torch.empty( ^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 3 has a total capacty of 15.89 GiB of which 61.88 MiB is free. Including non-PyTorch memory, this process has 15.82 GiB memory in use. Of the allocated memory 14.22 GiB is allocated by PyTorch, and 387.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-10-17 23:13:48,521] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 136199 [2023-10-17 23:13:51,768] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 136200 [2023-10-17 23:13:55,047] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 136201 [2023-10-17 23:13:58,292] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 136202 [2023-10-17 23:13:58,292] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 136203 [2023-10-17 23:14:01,610] [ERROR] [launch.py:321:sigkill_handler] ['/home/user/miniconda3/envs/llama2/bin/python', '-u', 'train.py', '--local_rank=4'] exits with return code = 1 ``` I would appreciate any help. ```[tasklist] ### Tasks - [ ] Add a draft title or issue reference here ```

Topic		Replies	Views
Why does all my gpu memory get used with a small model? Beginners	5	2137	March 13, 2022
Cuda Out of Memory with Multi-GPU Accelerate for gemma-2b 🤗Accelerate	1	127	December 22, 2024
Low RAM Usage & high GPU usage, Datasets not helping 🤗Datasets	3	1249	January 13, 2023
CUDA Out of Memory while fine-tuning even with LoRA Models	6	3197	April 12, 2024
Out of memory training 3B param model on 8 GPU (320GB memory) with FSDP Intermediate	1	1691	July 28, 2023

2B Model Fill Up Memory Usage on 4xA100s

Related topics