Say we would like to use the Transformers+DeepSpeed integration to fine-tune a relatively large model. That model is too big to fit both the parameters and the full optimizer states in GPU memory at once, so instead we want to freeze most of the parameters and fine-tune a subset of them, or alternatively to tune an adapter that wraps the model. That way we avoid needing to store Adam buffers for the frozen ones. Additionally, we want to use DeepSpeed for ZeRO-Offload.
In order to do that, I believe we need to manually pass our optimizer to Trainer (otherwise Trainer will create an optimizer for all of the parameters, which we would like to avoid). But it looks like DeepSpeed itself has certain optimizer configurations. If we pass a custom optimizer to Trainer along with the DeepSpeed config, will it appropriately train only the subset of parameters specified in that optimizer, or will it end up creating another on the backend that tries to optimize the whole model?