Accelerate not spreading on multiple CPUs

Hello there,

Accelerate version: 0.21.0
Transformers version: 4.31.0
Torch version: 2.0.1
Python version: 3.10.9
OS: Ubuntu 20.04

I tried accelerate for inference on llama2 with an A10 GPU and a 16 cores CPU. I spread llama using the device_map below (using device_map=“auto” systematically ends in CUDA OOM). With this configuration, using only CPUs is much faster than using accelerate. As an example, the following code runs in about one minute 30 on CPUs and more than two minutes using accelerate.

Looking at CPU usage with a simple htop, it seems only one CPU is used by accelerate.

Is there a way to push accelerate towards using more than one CPU?

prompt = """
Translate the following text in French: ```My father was a good person```
"""

inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    generate_ids = model.generate(inputs.input_ids, max_length=64)
tokenizer.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
device_map = {
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": "cpu",
    "model.layers.15": "cpu",
    "model.layers.16": "cpu",
    "model.layers.17.self_attn.q_proj": "cpu",
    "model.layers.17.self_attn.k_proj": "cpu",
    "model.layers.17.self_attn.v_proj": "cpu",
    "model.layers.17.self_attn.o_proj": "cpu",
    "model.layers.17.self_attn.rotary_emb": "cpu",
    "model.layers.17.mlp": "cpu",
    "model.layers.17.input_layernorm": "cpu",
    "model.layers.17.post_attention_layernorm": "cpu",
    "model.layers.18": "cpu",
    "model.layers.19": "cpu",
    "model.layers.20": "cpu",
    "model.layers.21": "cpu",
    "model.layers.22": "cpu",
    "model.layers.23": "cpu",
    "model.layers.24": "cpu",
    "model.layers.25": "cpu",
    "model.layers.26": "cpu",
    "model.layers.27": "cpu",
    "model.layers.28": "cpu",
    "model.layers.29": "cpu",
    "model.layers.30": "cpu",
    "model.layers.31": "cpu",
    "model.layers.32": "cpu",
    "model.layers.33": "cpu",
    "model.layers.34": "cpu",
    "model.layers.35": "cpu",
    "model.layers.36": "cpu",
    "model.layers.37": "cpu",
    "model.layers.38": "cpu",
    "model.layers.39": "cpu",
    "model.norm": "cpu",
    "lm_head": "cpu",
}

While I am here, I also get the following error message:

ValueError ... containing mote than one `.index.json` file, delete the irrelevant ones.

It seems the config has two .json files: pytorch_model.bin.index.json and model.safetensors.index.json. I have to delete the second one manually, every time I reload the model. Is there a way to specify the json file as an attribute or force using the pytorch config, so I don’t have to do this manual work?

Thank you for your help.

Partially answering my own question.

Not using accelerate improved execution speed tremendously.

Instead of using load_checkpoint_and_dispatch as shown here

from accelerate import load_checkpoint_and_dispatch

model = load_checkpoint_and_dispatch(
    model,
    checkpoint=weights_location,
    device_map=device_map,
    no_split_module_classes=["Block"],
)

Using simply

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    use_auth_token=True,
    device_map=device_map[checkpoint],
)

Makes the model much faster. Not sure why, though.