### System Info
```Shell
compute_environment: LOCAL_MACHINE …
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: "BertEmbeddings,BertLayer,BertPooler"
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: "no"
num_machines: 1
num_processes: 5
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [X] My own task or dataset (give details below)
### Reproduction
I'm finetuning a text embedding model with the [sentence-transformers library](https://sbert.net/index.html). The model is [gte-base](https://huggingface.co/thenlper/gte-base/tree/main). From the config.json, we can tell gte-base is build on BertModel. While finetuning the model on a toy dataset, I can run it smoothly when ddp is used along with accelerate, as this model fits into one GPU. When I try to train it with FSDP, also with accelerate, different types of errors pop up. After several hours' exploration, I have narrowed the issue to this "fsdp_transformer_layer_cls_to_wrap" argument. Here's what I have done:
1. Does not specify anything for this argument:
```
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 164, in forward
return F.embedding(
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/nn/functional.py", line 2267, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D
```
Apparently, the error comes from nn.embedding layer. FSDP flattens the parameters into 1D tensor, but the forward pass cannot restore the original view of the tensor.
2. Following most the tutorials, I set `fsdp_transformer_layer_cls_to_wrap: BertLayer`
I got the same error again.
3. So I added the embedding layer to this argument, `fsdp_transformer_layer_cls_to_wrap: "BertEmbeddings,BertLayer"`
```
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 747, in forward
pooled_output = self.dense(first_token_tensor)
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 117, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat2 must be a matrix, got 1-D tensor
```
The error came from the bert pooling layer now, as it's not in the argument, so I added the pooler to the argument.
4. `fsdp_transformer_layer_cls_to_wrap: "BertEmbeddings,BertLayer,BertPooler"`
```
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train
return inner_training_loop(
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/transformers/trainer.py", line 2434, in _inner_training_loop
_grad_norm = self.accelerator.clip_grad_norm_(
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/accelerate/accelerator.py", line 2372, in clip_grad_norm_
return model.clip_grad_norm_(max_norm, norm_type)
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1111, in clip_grad_norm_
_lazy_init(self, self)
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 139, in _lazy_init
_share_state_and_init_handle_attrs(state, root_module)
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 208, in _share_state_and_init_handle_attrs
_p_assert(
File "/nethome/speng65/miniconda3/envs/meta/lib/python3.10/site-packages/torch/distributed/utils.py", line 166, in _p_assert
raise AssertionError(s)
AssertionError: Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False`
```
This is a new error, and I cannot find valuable resources on resolving this issue. [This](https://github.com/pytorch/pytorch/issues/113496) seems to the closest one, but no solutions are provided at the end.
So, my questions are:
1. What should I set for this argument? Should I include all the model modules defined in modeling_xxx.py file when training a transformer model with fsdp?
2. If I launch the code with `accelerate launch --config_file fsdp.yaml python xxx`, do i have to set these fsdp related arguments in the training arguments again, e.g., [fsdp, fsdp_config](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.fsdp)?
3. How to resolve the above bug?
### Expected behavior
See above