**Describe the bug**
I run deepspeed.runtime.zero.stage3.estimate_zero3_model_s…tates_mem_needs_all_live(). It says in stage 3, with offload_optimizer and offload_param enabled, it requires 0.49GB of GPU memory for params, optim states and gradients
```
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 4 GPUs per node.
SW: Model with 6805M total params, 131M largest layer params.
per CPU | per GPU | Options
171.13GB | 0.49GB | offload_param=OffloadDeviceEnum.cpu, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=1
171.13GB | 0.49GB | offload_param=OffloadDeviceEnum.cpu, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=0
152.11GB | 3.66GB | offload_param=none, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=1
152.11GB | 3.66GB | offload_param=none, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=0
2.93GB | 29.01GB | offload_param=none, offload_optimizer=none, zero_init=1
152.11GB | 29.01GB | offload_param=none, offload_optimizer=none, zero_init=0
```
But in training, the GPU memory usage is 14,678MiB after `deepspeed.initialize()`, 19,890MiB after the first forward `engine(input_token)`, 20,148MiB after the first backward `engine.backward(loss)` and `engine.step()`, 1,838MiB before the second forward, 21,272MiB after the second forward. And crashes during the second backward for backwarding loss for twice, I report this in another issue. [https://github.com/microsoft/DeepSpeed/issues/4528](url)
I obtained the memory usage through `nvidia-smi`. The above results are obtained on 1x RTX 3090ti(24G).
My question is:
1. Why deepspeed uses 14678MiB memory after `deepspeed.initialize()`. It casues oom in the first forward on 4x P100(16G). I primarily train my model on 4x P100, RTX 3090ti is just for testing.
2. How does DeepSpeed utilize GPU memory? I have enabled cpu offload for parameters and optimizer, the 1,838MiB of memory usage between the first backward and the second forward, I speculate, is due to the parameters and optimizer being offloaded to the CPU after which the parameters and optimizer are accounted for. Then, is the additional approximately 20,000MiB after the first forward all activations? If so, I have already enabled activation checkpointing, but the memory usage is still high.
3. What is the difference between `zero_init=1` and `zero_init=0`. Why is there such a big difference in CPU memory usage between setting `zero_init` to 1 and 0 when not using `offload_optimizer` and `offload_param`? How to set `zero_init`=1 or 0? I did not find any description regarding `zero_init` in the documentation.
4. Is there any chance I could continue my training on 4 or 8 P100 with 16384MiB memory? I'm finetuning llama2 with 6,700,000,000 of parameter with LoRA. Num of trainable parameters is 67,329,792 (1% of llama2).
**To Reproduce**
Code of llama2 model can be found here [https://github.com/facebookresearch/llama/blob/main/llama/model.py](url), I implemented LoRA myself, each lora linear has different LoRA rank. Each linear in my model is replaced to
```
class Adaptive_Linear(torch.nn.Module):
def __init__(
self,
model_arg: dict, # batch size, max seqlen, vocab size, etc.
name: str, # example: layer_list.0.attention.key
lora_rank_dict: dict, # lora rank of each linear
in_features: int,
out_features: int
):
super().__init__()
self.name = name
self.model_arg = model_arg
self.linear = torch.nn.Linear(in_features, out_features, bias = False)
self.lora_rank = lora_rank_dict[self.name]
if self.lora_rank > 0:
self.lora_dropout = torch.nn.Dropout(model_arg.lora_dropout)
self.lora_a_linear = torch.nn.Linear(in_features, self.lora_rank, bias = False)
self.lora_b_linear = torch.nn.Linear(self.lora_rank, out_features, bias = False)
self.lora_a_linear.weight.data = torch.rand_like(self.lora_a_linear.weight.data) - 1 * 2
self.lora_b_linear.weight.data = torch.zeros_like(self.lora_b_linear.weight.data)
def trainable_parameters(self):
# only train lora matrix
# the trainable parameters of my model are trainable_parameters
# of all Adaptive_Linear connected together using itertools.chain.
return iter((self.lora_a_linear.weight, self.lora_b_linear.weight))
def forward(self, x):
hidden = self.linear(x)
if self.lora_rank > 0:
input = self.lora_dropout(x)
lora_a = self.lora_a_linear(input)
lora_b = self.lora_b_linear(lora_a)
output = hidden + lora_b
else:
output = hidden
return output
```
My dataset is
```
class Dataset(torch.utils.data.Dataset):
def __init__(self, model_arg, dataset_path):
self.model_arg = model_arg
self.dataset_path = dataset_path
self.dataset = torch.load(self.dataset_path) # List[Tuple] list of tuples of (input token list, output token list, label) token and label are int
self.dataset = [data_item for data_item in self.dataset if len(data_item[0]) >= MIN_SEQ_LEN and len(data_item[0]) <= MAX_SEQ_LEN]
def __len__(self):
return len(self.dataset)
def __getitem__(self, index):
input_token, output_token, label = self.dataset
return torch.tensor(input_token), torch.tensor(output_token), torch.tensor(label)
```
Collate function to pad tokens of a batch to same length
```
collate_fn(batch):
input_token_list, output_token_list, label_list = zip(*batch)
padded_input_token = torch.nn.utils.rnn.pad_sequence([seq for seq in input_token_list], batch_first = True)
padded_output_token = torch.nn.utils.rnn.pad_sequence([seq for seq in output_token_list], batch_first = True)
return padded_input_token, padded_output_token, label_list
```
Initialize model and deepspeed
```
model = Llama_2_With_Adaptive_Lora(model_arg, lora_rank_dict)
model.load_state_dict(state_dict)
for name, parameter in model.named_parameters():
if not ('lora' in name):
parameter.requires_grad = False
engine, optimizer, train_dataloader, _= deepspeed.initialize(
model = model,
model_parameters = model.trainable_parameters(),
training_data = train_dataset,
args = args, # local_rank only
collate_fn = collate_fn,
config = deepspeed_config
)
```
Deepspeed config
```
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
"gradient_clipping": 1.0,
"steps_per_print": 10,
"wall_clock_breakdown": false,
"memory_breakdown": false,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 2e-5,
"betas": [
0.9,
0.95
],
"eps": 1e-8,
"weight_decay": 1e-4
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 2e-5,
"warmup_num_steps": 1000,
"total_num_steps": 10000
}
},
"bf16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 1e7,
"reduce_scatter": true,
"reduce_bucket_size": 1e7,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"stage3_max_live_parameters": 1e8,
"stage3_max_reuse_distance": 1e8,
"stage3_gather_16bit_weights_on_model_save": true
},
"activation_checkpointing": {
"partition_activations": true,
"number_checkpoints": 100,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"synchronize_checkpoint_boundary": false,
"profile": true
},
"data_types": {
"grad_accum_dtype":"bf16"
}
}
```
Main code of training
```
engine.train()
for input_token, target_token, _ in train_dataloader:
input_token = input_token. to(engine.device)
target_token = target_token.to(engine.device)
pred_logits = engine(input_token)
target_pos = (target_token != PAD_ID) # only calculate loss on non-pad token
pred_logits = pred_logits[ target_pos].view(-1, model_arg.vocab_size)
target_token = target_token[target_pos].view(-1)
loss = torch.nn.functional.cross_entropy(pred_logits, target_token)
engine.backward(loss)
engine.step()
```
**Expected behavior**
Finetun my model on P100 without oom
**ds_report output**
ds_report of machine one with 1x RTX3090ti(24G). The above GPU usage information was obtained on this machine.
```
[2023-10-17 22:28:50,665] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch']
torch version .................... 2.1.0+cu121
deepspeed install path ........... ['/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.11.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
shared memory (/dev/shm) size .... 125.77 GB
```
ds_report of machine 2 with 8x P100(16G). I finetune my model on this machine.
```
[2023-10-17 22:28:40,428] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [YES] ...... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch']
torch version .................... 2.1.0+cu118
deepspeed install path ........... ['/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.11.2+unknown, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8
shared memory (/dev/shm) size .... 125.78 GB
```
**Screenshots**
No Screenshots
**System info (please complete the following information):**
Machine 1:
OS: Ubuntu 20.04.4
GPU: 1x RTX 3090ti
Interconnects: No
Python version: 3.11.4
Machine 2:
OS: Ubuntu 16.04.6
GPU: 8x Tesla P100
Interconnects: No
Python version: 3.11.5
**Launcher context**
Machine 1:
`PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 deepspeed --include localhost:0 train.py`
Machine 2:
`PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 deepspeed --include localhost:0,1,2,3 train.py`
**Docker context**
No docker
**Additional context**
In Reproduce, I simplified some of the code, so the error traceback is slightly different from Reproduce.
Complete deepspeed output on machine 1 (1x RTX 3090)
The error is trying to backward through the graph a second time. I report it in another issue. [https://github.com/microsoft/DeepSpeed/issues/4528](url)
The issue is too long to submit, I cannot attach the output log here.
Please check the output in the issue 4528. I apologize for any inconvenience caused.
Complete deepspeed output on machine 2 (8x P100)
The error is oom
```
[2023-10-17 23:10:48,978] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-17 23:10:50,117] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-10-17 23:10:50,117] [INFO] [runner.py:570:main] cmd = /home/user/miniconda3/envs/llama2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py
[2023-10-17 23:10:52,522] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-17 23:10:53,468] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4]}
[2023-10-17 23:10:53,468] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=5, node_rank=0
[2023-10-17 23:10:53,468] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4]})
[2023-10-17 23:10:53,468] [INFO] [launch.py:163:main] dist_world_size=5
[2023-10-17 23:10:53,468] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4
[2023-10-17 23:10:55,910] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-17 23:10:55,948] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-17 23:10:55,955] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-17 23:10:55,955] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-17 23:10:55,994] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-17 23:10:57,398] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-17 23:10:57,767] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-17 23:10:57,816] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-17 23:10:57,841] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-17 23:10:57,842] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-17 23:10:57,843] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-10-17 23:13:00,132] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.11.2+unknown, git-hash=unknown, git-branch=unknown
[2023-10-17 23:13:12,414] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000100, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000100, adam_w=1
[2023-10-17 23:13:14,520] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-10-17 23:13:14,520] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000100, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000100, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000100, adam_w=1
[2023-10-17 23:13:14,561] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-10-17 23:13:14,561] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-10-17 23:13:14,561] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2023-10-17 23:13:14,561] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2023-10-17 23:13:15,405] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning
[2023-10-17 23:13:15,406] [INFO] [utils.py:803:see_memory_usage] MA 12.99 GB Max_MA 12.99 GB CA 13.01 GB Max_CA 13 GB
[2023-10-17 23:13:15,406] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 121.56 GB, percent = 48.3%
[2023-10-17 23:13:15,412] [INFO] [stage3.py:126:__init__] Reduce bucket size 10000000
[2023-10-17 23:13:15,412] [INFO] [stage3.py:127:__init__] Prefetch bucket size 10000000
[2023-10-17 23:13:15,841] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-10-17 23:13:15,841] [INFO] [utils.py:803:see_memory_usage] MA 12.99 GB Max_MA 12.99 GB CA 13.01 GB Max_CA 13 GB
[2023-10-17 23:13:15,841] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 121.56 GB, percent = 48.3%
Parameter Offload: Total persistent parameters: 10247936 in 260 params
[2023-10-17 23:13:28,903] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-10-17 23:13:28,904] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 12.99 GB CA 13.01 GB Max_CA 13 GB
[2023-10-17 23:13:28,904] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 143.14 GB, percent = 56.9%
[2023-10-17 23:13:29,558] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions
[2023-10-17 23:13:29,558] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB
[2023-10-17 23:13:29,558] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 143.04 GB, percent = 56.9%
[2023-10-17 23:13:35,210] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 1
[2023-10-17 23:13:35,211] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB
[2023-10-17 23:13:35,212] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 143.57 GB, percent = 57.1%
[2023-10-17 23:13:35,692] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions
[2023-10-17 23:13:35,693] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB
[2023-10-17 23:13:35,695] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 143.57 GB, percent = 57.1%
[2023-10-17 23:13:36,267] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2023-10-17 23:13:36,268] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB
[2023-10-17 23:13:36,268] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 143.68 GB, percent = 57.1%
[2023-10-17 23:13:37,129] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2023-10-17 23:13:37,129] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB
[2023-10-17 23:13:37,129] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 144.35 GB, percent = 57.4%
[2023-10-17 23:13:37,776] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2023-10-17 23:13:37,777] [INFO] [utils.py:803:see_memory_usage] MA 0.31 GB Max_MA 0.31 GB CA 13.01 GB Max_CA 13 GB
[2023-10-17 23:13:37,777] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 144.51 GB, percent = 57.4%
[2023-10-17 23:13:37,778] [INFO] [stage3.py:459:_setup_for_real_optimizer] optimizer state initialized
[2023-10-17 23:13:38,669] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2023-10-17 23:13:38,670] [INFO] [utils.py:803:see_memory_usage] MA 0.33 GB Max_MA 0.34 GB CA 13.01 GB Max_CA 13 GB
[2023-10-17 23:13:38,670] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 114.13 GB, percent = 45.4%
[2023-10-17 23:13:38,671] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2023-10-17 23:13:38,671] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR
[2023-10-17 23:13:38,671] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7fa76bb5fe10>
[2023-10-17 23:13:38,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[2e-05], mom=[[0.9, 0.95]]
[2023-10-17 23:13:38,673] [INFO] [config.py:968:print] DeepSpeedEngine configuration:
[2023-10-17 23:13:38,673] [INFO] [config.py:972:print] activation_checkpointing_config {
"partition_activations": true,
"contiguous_memory_optimization": true,
"cpu_checkpointing": true,
"number_checkpoints": 100,
"synchronize_checkpoint_boundary": false,
"profile": true
}
[2023-10-17 23:13:38,673] [INFO] [config.py:972:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-10-17 23:13:38,673] [INFO] [config.py:972:print] amp_enabled .................. False
[2023-10-17 23:13:38,673] [INFO] [config.py:972:print] amp_params ................... False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] bfloat16_enabled ............. True
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] checkpoint_parallel_write_pipeline False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] checkpoint_tag_validation_enabled True
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] checkpoint_tag_validation_fail False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7faac78d7510>
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] communication_data_type ...... None
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] curriculum_enabled_legacy .... False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] curriculum_params_legacy ..... False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] data_efficiency_enabled ...... False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] dataloader_drop_last ......... False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] disable_allgather ............ False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] dump_state ................... False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] dynamic_loss_scale_args ...... None
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_enabled ........... False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_gas_boundary_resolution 1
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_layer_num ......... 0
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_max_iter .......... 100
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_stability ......... 1e-06
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_tol ............... 0.01
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] eigenvalue_verbose ........... False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] elasticity_enabled ........... False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] fp16_auto_cast ............... None
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] fp16_enabled ................. False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] fp16_master_weights_and_gradients False
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] global_rank .................. 0
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] grad_accum_dtype ............. bf16
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] gradient_accumulation_steps .. 8
[2023-10-17 23:13:38,674] [INFO] [config.py:972:print] gradient_clipping ............ 1.0
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] gradient_predivide_factor .... 1.0
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] initial_dynamic_scale ........ 1
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] load_universal_checkpoint .... False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] loss_scale ................... 1.0
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] memory_breakdown ............. False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] mics_hierarchial_params_gather False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] mics_shard_size .............. -1
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] optimizer_legacy_fusion ...... False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] optimizer_name ............... adamw
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.95], 'eps': 1e-08, 'weight_decay': 0.0001}
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] pld_enabled .................. False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] pld_params ................... False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] prescale_gradients ........... False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] scheduler_name ............... WarmupDecayLR
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 2e-05, 'warmup_num_steps': 1000, 'total_num_steps': 10000}
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] sparse_attention ............. None
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] sparse_gradients_enabled ..... False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] steps_per_print .............. 10
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] train_batch_size ............. 40
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] train_micro_batch_size_per_gpu 1
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] use_node_local_storage ....... False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] wall_clock_breakdown ......... False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] weight_quantization_config ... None
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] world_size ................... 5
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] zero_allow_untested_optimizer False
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=10000000 allgather_partitions=True allgather_bucket_size=10000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=10000000 param_persistence_threshold=100000 model_persistence_threshold=sys.maxsize max_live_parameters=100000000 max_reuse_distance=100000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] zero_enabled ................. True
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] zero_force_ds_cpu_optimizer .. True
[2023-10-17 23:13:38,675] [INFO] [config.py:972:print] zero_optimization_stage ...... 3
[2023-10-17 23:13:38,676] [INFO] [config.py:958:print_user_config] json = {
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
"gradient_clipping": 1.0,
"steps_per_print": 10,
"wall_clock_breakdown": false,
"memory_breakdown": false,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 2e-05,
"betas": [0.9, 0.95],
"eps": 1e-08,
"weight_decay": 0.0001
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 2e-05,
"warmup_num_steps": 1000,
"total_num_steps": 1.000000e+04
}
},
"bf16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 1.000000e+07,
"reduce_scatter": true,
"reduce_bucket_size": 1.000000e+07,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1.000000e+09,
"stage3_prefetch_bucket_size": 1.000000e+07,
"stage3_param_persistence_threshold": 1.000000e+05,
"stage3_max_live_parameters": 1.000000e+08,
"stage3_max_reuse_distance": 1.000000e+08,
"stage3_gather_16bit_weights_on_model_save": true
},
"activation_checkpointing": {
"partition_activations": true,
"number_checkpoints": 100,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"synchronize_checkpoint_boundary": false,
"profile": true
},
"data_types": {
"grad_accum_dtype": "bf16"
}
}
Traceback (most recent call last):
File "/data/user/llama2/src/7_train/train.py", line 449, in <module>
trainer.train_train()
File "/data/user/llama2/src/7_train/train.py", line 408, in train_train
self.train_epoch(epoch)
File "/data/user/llama2/src/7_train/train.py", line 430, in train_epoch
pred_logits = self.model_engine(input_token)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1807, in forward
loss = self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/user/llama2/src/7_train/../0_all_types_of_llama2/Llama_2_Adaptive_Lora.py", line 275, in forward
hidden = layer(hidden, start_pos, freqs_cis, mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/user/llama2/src/7_train/../0_all_types_of_llama2/Llama_2_Adaptive_Lora.py", line 230, in forward
hidden = hidden + self.feed_forward(self.ffn_norm(hidden))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/user/llama2/src/7_train/../0_all_types_of_llama2/Llama_2_Adaptive_Lora.py", line 198, in forward
hidden = self.w2(hidden)
^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/user/llama2/src/7_train/../0_all_types_of_llama2/Llama_2_Adaptive_Lora.py", line 30, in forward
hidden = self.linear(x)
^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1557, in _call_impl
args_result = hook(self, args)
^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
self.__all_gather_params(params_to_fetch, forward)
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gather_params_
handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/llama2/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1109, in all_gather_coalesced
param_buffer = torch.empty(
^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 3 has a total capacty of 15.89 GiB of which 61.88 MiB is free. Including non-PyTorch memory, this process has 15.82 GiB memory in use. Of the allocated memory 14.22 GiB is allocated by PyTorch, and 387.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-10-17 23:13:48,521] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 136199
[2023-10-17 23:13:51,768] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 136200
[2023-10-17 23:13:55,047] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 136201
[2023-10-17 23:13:58,292] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 136202
[2023-10-17 23:13:58,292] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 136203
[2023-10-17 23:14:01,610] [ERROR] [launch.py:321:sigkill_handler] ['/home/user/miniconda3/envs/llama2/bin/python', '-u', 'train.py', '--local_rank=4'] exits with return code = 1
```
I would appreciate any help.
```[tasklist]
### Tasks
- [ ] Add a draft title or issue reference here
```