Hello,
I am trying to launch the training of a large model in multi-node/multi-gpu setting with “accelerate” using DeepSpeed plugin (no DS config file) with 8-bit adam and LR cosine annealing scheduler. Yet, deepspeed doesn’t seem to use the 8-bit adam from BnB set in my python script but rather regular AdamW, while the documentation seems to indicate that this should work for custom optimizer/scheduler… Any idea what’s happening here? Is there a specific setup for this?
thanks