How to load part of the model weight to inference?

llxnb · June 28, 2023, 1:25pm

Hi, I want to check the process of loading big LLM model using offload functions, such as load_checkpoint_and_dispatch or dispatch_model. I want to just load part of the LLM model weight on gpu to do inference while other parts on cpu or disk, is there a way to do this?

Topic		Replies	Views
Using loaded model with accelerate for inference 🤗Accelerate	3	1991	November 4, 2022
How to see what part of model are offloaded to CPU? 🤗Transformers	1	122	August 7, 2024
Big Model Inference: CPU/Disk Offloading for Transformers Using from_pretrained 🤗Accelerate	2	4623	February 28, 2024
Loading model directly to GPU omitting RAM Beginners	6	63	March 28, 2025
Load_checkpoint_and_dispatch without heavy system memory usage 🤗Accelerate	1	3072	April 10, 2023

How to load part of the model weight to inference?

Related topics