How to see what part of model are offloaded to CPU?

khayamgondal · June 18, 2024, 5:20pm

I am loading llama65 for inference with device_map=“auto”. Is there a way to check what layers are actually offloaded? Also is there a way to specify what part of the model to offload? I am not using deepspeed since I am using an ARM64 machine (GH200) and deepspeed doesn’t support ARM yet.
I am loading model like this

model = AutoModelForCausalLM.from_pretrained("/models/LLAMA-HF/llama-65b-hf/", device_map="auto")

anferico · August 7, 2024, 9:07am

@khayamgondal you can design your own device map as a dictionary, for example:

device_map = {"block1": 0, "block2.linear1": 0, "block2.linear2": 1, "block2.linear3": 1}

where 0 and 1 are device (GPU) identifiers. This way, you can decide which modules of your model are offloaded to which GPU.

Unrelated question: where did you read that DeepSpeed does not support ARM and GH200 chips specifically?

Topic		Replies	Views
[SOLVED] What's the right way to do GPU paralellism for inference (not training) on AutoModelForCausalLM? 🤗Transformers	1	222	August 26, 2024
How to load part of the model weight to inference? 🤗Accelerate	0	356	June 28, 2023
Llama2-70b-chat loading Cuda Out of Memory Models	0	1215	July 26, 2023
Inference with CPU offload 🤗Accelerate	0	1604	August 10, 2023
Loading model directly to GPU omitting RAM Beginners	6	63	March 28, 2025

How to see what part of model are offloaded to CPU?

Related topics