Unable to load a model trained via FSDP


I have trained a model using FSDP. More specifically, I used this run_clm.py script, with the option --fsdp "shard_grad_op auto_wrap".

The training went fine and model was saved. However, while trying to load the model I get error:

Loading checkpoint shards: 100%|███████████████████████████████████████| 3/3 [00:00<00:00, 19.55it/s]
Traceback (most recent call last):
  File "/home/tarun/memory-llm-paper/run_experiments_pythia.py", line 49, in <module>
    model = AutoModelForCausalLM.from_pretrained('/data/users/tarun/coref/models/output32/checkpoint-345')
  File "/home/tarun/miniconda3/envs/coref-1/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/home/tarun/miniconda3/envs/coref-1/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4014, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/tarun/miniconda3/envs/coref-1/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4559, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for GPTNeoXForCausalLM:
        size mismatch for gpt_neox.embed_in.weight: copying a param with shape torch.Size([128778240]) from checkpoint, the shape in current model is torch.Size([50304, 2560]).
        size mismatch for gpt_neox.final_layer_norm.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560]).
        size mismatch for embed_out.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([50304, 2560]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

Now, if we specifically look at this line:

size mismatch for gpt_neox.embed_in.weight: copying a param with shape torch.Size([128778240]) from checkpoint, the shape in current model is torch.Size([50304, 2560]).

we can notice that 50304*2560 = 128778240. So it seems that at the end of FSDP training, the model’s params were stored in a flattened array. While trying to load it using from_pretrained the library isn’t able to unflatten it automatically.

This is my conda environment details if that helps:

Conda env details

Does anyone have any idea what could be causing this? I didn’t find much on surfing the internet.

This is about all I could find.

I couldn’t figure out the underlying reason for the bug.

But downgrading my torch version to below helped me get around this error:

conda install pytorch==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia

Previously, I had torch 2.3.0 and cuda 12.4 (as can be seen in the conda environment details mentioned in original question.)

It seems to be resolved, but I thought this must be a torch bug, so I did a search and it seems to be normal on the torch forum…
Well, it seems that’s just the way it is.:innocent: