PEFT LoRA GPT-NeoX - Backward pass failing

I have written a training script that makes use of the Accelerate and PEFT libraries to finetune GPT-NeoX and repeatedly encounter the following two messages resulting in a runtime error.

The first message is:

/opt/conda/envs/accelerate/lib/python3.7/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

and the second is:

File "/opt/conda/envs/accelerate/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I use the following code except to load the model.

peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules = ["query_key_value", "xxx"],
        bias="none",
        task_type="CAUSAL_LM",
    )

model = AutoModelForCausalLM.from_pretrained(model_name)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

The terminal command I am executing is:

accelerate launch train.py --data_path_file ./prompts.jsonl -m EleutherAI/gpt-neox-20b -te 3 -lr 1.41e-5 --eval_size 0.1 --batch_size 7 --gradient_checkpointing False

Any tips on successfully backpropagating using LoRA would be appreciated!

Environment details

`(accelerate) root@de1305f1fa1f:/mnt/training# python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.1
Libc version: glibc-2.10

Python version: 3.7.3 (default, Mar 27 2019, 22:11:17)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-58-generic-x86_64-with-debian-bullseye-sid
Is CUDA available: True
CUDA runtime version: 11.2.152
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: GRID A100D-7-80C
  MIG 7g.80gb     Device  0:

Nvidia driver version: 525.85.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.13.1
[conda] numpy                     1.21.6                   pypi_0    pypi
[conda] torch                     1.13.1                   pypi_0    pypi

Hi @eusip !
Can you share with us the entire training script? I suspect you are calling gradient checkpoint under the hood for some reason (even if the flag gradient_checkpointing is set to False)
The error you are seeing is due to the fact that the inputs does not have requires_grad set to True. For that you might need to call:

if hasattr(model, "enable_input_require_grads"):
    model.enable_input_require_grads()
else:
    def make_inputs_require_grad(module, input, output):
         output.requires_grad_(True)

    model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)

Somewhere in your training script, right before get_peft_model.
Let me know if this works

7 Likes

Thanks for the prompt response @ybelkada !

Your code snippet did the trick! For general reference my updated training script can be found here.

2 Likes

For future users, I had the same error messages but the posted solution didn’t work. For me the problem was caused by these lines:

from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

The fix was to remove them.

You saved my day.
Thank you.

I think the comments in this link may be helpful.

same, removed the line.

Thanks, I am getting the same error but this solution does not work for me.