I think it is working now, after
- downgrading to transformers 4.26.1 (which does not use the fsdp_config argument)
- removing fsdp_config argument
- adding back the fsdp_transformer_layer_cls_to_wrap argument
It is using less memory compared to non-FSDP mode, so I think the model is actually being sharded.