Calling other large models at runtime?

evankomp · February 3, 2025, 7:06pm

I am training a relatively large multimodal model on a definitely large dataset, which uses the embeddings of two other large models as inputs.

The cleanest solution is obviously to treat one large module. But I am worried about GPU memory even with a sharded model. I only easily have access to 4 h100s. So I am wondering if there is a way to efficiently offload the model being trained temporarily to load the pretrained ones and compute embeddings. Potentially doing this at the scale of multiple batches instead of every batch.

How much of a nightmare would this be and if doable are there any recommendations for how to approach it?

Thanks,
Evan

Topic		Replies	Views
Can't load huge model onto multiple GPU's Beginners	5	5184	June 15, 2023
How to load large model with multiple GPU cards? Beginners	8	43460	October 25, 2023
Big Model Inference: CPU/Disk Offloading for Transformers Using from_pretrained 🤗Accelerate	2	4622	February 28, 2024
Tensor parallelism for customized model 🤗Accelerate	0	227	September 2, 2024
Why does all my gpu memory get used with a small model? Beginners	5	2135	March 13, 2022

Calling other large models at runtime?

Related topics