Hi, I want to check the process of loading big LLM model using offload functions, such as load_checkpoint_and_dispatch or dispatch_model. I want to just load part of the LLM model weight on gpu to do inference while other parts on cpu or disk, is there a way to do this?
1 Like