I am loading llama65 for inference with device_map=“auto”. Is there a way to check what layers are actually offloaded? Also is there a way to specify what part of the model to offload? I am not using deepspeed since I am using an ARM64 machine (GH200) and deepspeed doesn’t support ARM yet.
I am loading model like this
model = AutoModelForCausalLM.from_pretrained("/models/LLAMA-HF/llama-65b-hf/", device_map="auto")