How to train a >100GB model with hugging face trainer

Hi, I want to train a model with >100GB, and it will OOM if I load using from_pretrained. What’s the suggestion of loading and saving 100GB models. For training, I can use FSDP to distribute weights across devices but I am stuck in model loading and saving.

I find this article but it only supports inference.

1 Like

@maxBing12345 did you find any solution?

Not sure if this can be done because I never tried this, but can you push it to the hub? Then you can just load/save from it

1 Like

You can take a look at this repo for big models loading GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.

For saving, you can use save_pretrained and set push_to_hub = True to push to HF hub, you can also set max_shard_size to shard the big models into smaller files.

1 Like