Hmmm, I seem to have forgotten to post my output… Smart me.
[2024-11-29 14:33:06,356] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-29 14:33:09,315] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2024-11-29 14:33:09,350] [INFO] [runner.py:555:main] cmd = /home/NotEnoughVRAM /.conda/envs/LLM_Trainer/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None gpt_neo_2B7_finetune.py
[2024-11-29 14:33:10,819] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-29 14:33:11,739] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-11-29 14:33:11,739] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-11-29 14:33:11,739] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-11-29 14:33:11,739] [INFO] [launch.py:163:main] dist_world_size=2
[2024-11-29 14:33:11,739] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-11-29 14:33:14,109] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-29 14:33:14,109] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
NVIDIA A100-SXM4-40GBNVIDIA A100-SXM4-40GB
NVIDIA A100-SXM4-40GBNVIDIA A100-SXM4-40GB
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Using pad_token, but it is not set yet.
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/data/datasets/language_modeling.py:53: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py
warnings.warn(
Using pad_token, but it is not set yet.
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/data/datasets/language_modeling.py:53: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py
warnings.warn(
Number of samples in the dataset: 3256
[2024-11-29 14:33:45,099] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-11-29 14:33:45,100] [INFO] [comm.py:594:init_distributed] cdb=None
Number of samples in the dataset: 3256
[2024-11-29 14:33:45,167] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-11-29 14:33:45,167] [INFO] [comm.py:594:init_distributed] cdb=None
e[93m [WARNING] e[0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!e[93m [WARNING] e[0m cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /gpfs/home1/NotEnoughVRAM /.cache/torch_extensions/py38_cu121 as PyTorch extensions root...Using /gpfs/home1/NotEnoughVRAM /.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Emitting ninja build file /gpfs/home1/NotEnoughVRAM /.cache/torch_extensions/py38_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7574810981750488 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7965133190155029 seconds
Parameter Offload: Total persistent parameters: 824320 in 226 params
0%| | 0/406 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905975447/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905975447/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Parameter module.transformer.wpe.weight has None gradient
Parameter module.transformer.h.0.ln_1.weight has None gradient
Parameter module.transformer.h.0.ln_1.bias has None gradient
Parameter module.transformer.h.0.attn.attention.k_proj.weight has None gradient
Parameter module.transformer.h.0.attn.attention.v_proj.weight has None gradient
Parameter module.transformer.h.0.attn.attention.q_proj.weight has None gradient
Parameter module.transformer.h.0.attn.attention.out_proj.weight has None gradient
Parameter module.transformer.h.0.attn.attention.out_proj.bias has None gradient
Parameter module.transformer.h.0.ln_2.weight has None gradient
Parameter module.transformer.h.0.ln_2.bias has None gradientParameter module.transformer.wpe.weight has None gradient
<SNIP due to char limit of discourse>
Parameter module.transformer.h.31.mlp.c_proj.bias has None gradient
Parameter module.transformer.ln_f.weight has None gradient
Parameter module.transformer.ln_f.bias has None gradient
0%| | 1/406 [00:57<6:25:55, 57.17s/it][rank1]: Traceback (most recent call last):
[rank1]: File "gpt_neo_2B7_finetune.py", line 130, in <module>
[rank1]: trainer.train()
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
[rank1]: return inner_training_loop(
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: File "gpt_neo_2B7_finetune.py", line 81, in training_step
[rank1]: accelerator.backward(loss)
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/accelerate/accelerator.py", line 1316, in backward
[rank1]: loss.backward(**kwargs)
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/_tensor.py", line 525, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/function.py", line 301, in apply
[rank1]: return user_fn(self, *args)
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 320, in backward
[rank1]: torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1006, in reduce_partition_and_remove_grads
[rank1]: self.reduce_ready_partitions_and_remove_grads(param, i)
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1286, in reduce_ready_partitions_and_remove_grads
[rank1]: self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1041, in reduce_independent_p_g_buckets_and_remove_grads
[rank1]: self.__reduce_and_partition_ipg_grads()
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1069, in __reduce_and_partition_ipg_grads
[rank1]: if param.grad.numel() != param.ds_numel:
[rank1]: AttributeError: 'NoneType' object has no attribute 'numel'
[rank0]: Traceback (most recent call last):
[rank0]: File "gpt_neo_2B7_finetune.py", line 130, in <module>
[rank0]: trainer.train()
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: File "gpt_neo_2B7_finetune.py", line 81, in training_step
[rank0]: accelerator.backward(loss)
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/accelerate/accelerator.py", line 1316, in backward
[rank0]: loss.backward(**kwargs)
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/_tensor.py", line 525, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/function.py", line 301, in apply
[rank0]: return user_fn(self, *args)
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 320, in backward
[rank0]: torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1006, in reduce_partition_and_remove_grads
[rank0]: self.reduce_ready_partitions_and_remove_grads(param, i)
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1286, in reduce_ready_partitions_and_remove_grads
[rank0]: self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1041, in reduce_independent_p_g_buckets_and_remove_grads
[rank0]: self.__reduce_and_partition_ipg_grads()
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/NotEnoughVRAM /.conda/envs/LLM_Trainer/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1069, in __reduce_and_partition_ipg_grads
[rank0]: if param.grad.numel() != param.ds_numel:
[rank0]: AttributeError: 'NoneType' object has no attribute 'numel'
That last error is due to the NoneType…
But as you can see, it thinks the entire model has non gradients.
So I requested the versions:
Python 3.8.13
torch version: 2.3.1
transformers version: 4.28.1
accelerate version: 0.15.0
deepspeed version: 0.9.5
I do not have pytorch-lightning installed.
I will try another model, maybe this one just clashes with that combination of lib versions.