Accelerate not spreading on multiple CPUs

mvonwyl · July 31, 2023, 10:49am

Hello there,

Accelerate version: 0.21.0
Transformers version: 4.31.0
Torch version: 2.0.1
Python version: 3.10.9
OS: Ubuntu 20.04

I tried accelerate for inference on llama2 with an A10 GPU and a 16 cores CPU. I spread llama using the device_map below (using device_map=“auto” systematically ends in CUDA OOM). With this configuration, using only CPUs is much faster than using accelerate. As an example, the following code runs in about one minute 30 on CPUs and more than two minutes using accelerate.

Looking at CPU usage with a simple htop, it seems only one CPU is used by accelerate.

Is there a way to push accelerate towards using more than one CPU?

prompt = """
Translate the following text in French: ```My father was a good person```
"""

inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    generate_ids = model.generate(inputs.input_ids, max_length=64)
tokenizer.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

device_map = {
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": "cpu",
    "model.layers.15": "cpu",
    "model.layers.16": "cpu",
    "model.layers.17.self_attn.q_proj": "cpu",
    "model.layers.17.self_attn.k_proj": "cpu",
    "model.layers.17.self_attn.v_proj": "cpu",
    "model.layers.17.self_attn.o_proj": "cpu",
    "model.layers.17.self_attn.rotary_emb": "cpu",
    "model.layers.17.mlp": "cpu",
    "model.layers.17.input_layernorm": "cpu",
    "model.layers.17.post_attention_layernorm": "cpu",
    "model.layers.18": "cpu",
    "model.layers.19": "cpu",
    "model.layers.20": "cpu",
    "model.layers.21": "cpu",
    "model.layers.22": "cpu",
    "model.layers.23": "cpu",
    "model.layers.24": "cpu",
    "model.layers.25": "cpu",
    "model.layers.26": "cpu",
    "model.layers.27": "cpu",
    "model.layers.28": "cpu",
    "model.layers.29": "cpu",
    "model.layers.30": "cpu",
    "model.layers.31": "cpu",
    "model.layers.32": "cpu",
    "model.layers.33": "cpu",
    "model.layers.34": "cpu",
    "model.layers.35": "cpu",
    "model.layers.36": "cpu",
    "model.layers.37": "cpu",
    "model.layers.38": "cpu",
    "model.layers.39": "cpu",
    "model.norm": "cpu",
    "lm_head": "cpu",
}

While I am here, I also get the following error message:

ValueError ... containing mote than one `.index.json` file, delete the irrelevant ones.

It seems the config has two .json files: pytorch_model.bin.index.json and model.safetensors.index.json. I have to delete the second one manually, every time I reload the model. Is there a way to specify the json file as an attribute or force using the pytorch config, so I don’t have to do this manual work?

Thank you for your help.

mvonwyl · August 1, 2023, 4:27pm

Partially answering my own question.

Not using accelerate improved execution speed tremendously.

Instead of using load_checkpoint_and_dispatch as shown here

from accelerate import load_checkpoint_and_dispatch

model = load_checkpoint_and_dispatch(
    model,
    checkpoint=weights_location,
    device_map=device_map,
    no_split_module_classes=["Block"],
)

Using simply

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    use_auth_token=True,
    device_map=device_map[checkpoint],
)

Makes the model much faster. Not sure why, though.

Topic		Replies	Views
Why transformers doesn't use Multiple GPUs (to increase tokens per second)? Beginners	7	596	September 22, 2024
Why does Transformer (LLaMa 3.1-8B) give different logits during inference for the same sample when used with single versus multi gpu prediction? 🤗Accelerate	0	99	September 20, 2024
Does anyone have an idea how we can run llama2 with multiple GPUs? 🤗Transformers	1	1277	October 26, 2023
Multi-GPU inference with accelerate Beginners	0	1713	October 19, 2023
Multi-GPU inference with LLM produces gibberish 🤗Transformers	14	6557	September 28, 2024

Accelerate not spreading on multiple CPUs

Related topics