Hello there,
Accelerate version: 0.21.0
Transformers version: 4.31.0
Torch version: 2.0.1
Python version: 3.10.9
OS: Ubuntu 20.04
I tried accelerate for inference on llama2 with an A10 GPU and a 16 cores CPU. I spread llama using the device_map below (using device_map=“auto” systematically ends in CUDA OOM). With this configuration, using only CPUs is much faster than using accelerate. As an example, the following code runs in about one minute 30 on CPUs and more than two minutes using accelerate.
Looking at CPU usage with a simple htop
, it seems only one CPU is used by accelerate.
Is there a way to push accelerate towards using more than one CPU?
prompt = """
Translate the following text in French: ```My father was a good person```
"""
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
generate_ids = model.generate(inputs.input_ids, max_length=64)
tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
device_map = {
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 0,
"model.layers.2": 0,
"model.layers.3": 0,
"model.layers.4": 0,
"model.layers.5": 0,
"model.layers.6": 0,
"model.layers.7": 0,
"model.layers.8": 0,
"model.layers.9": 0,
"model.layers.10": 0,
"model.layers.11": 0,
"model.layers.12": 0,
"model.layers.13": 0,
"model.layers.14": "cpu",
"model.layers.15": "cpu",
"model.layers.16": "cpu",
"model.layers.17.self_attn.q_proj": "cpu",
"model.layers.17.self_attn.k_proj": "cpu",
"model.layers.17.self_attn.v_proj": "cpu",
"model.layers.17.self_attn.o_proj": "cpu",
"model.layers.17.self_attn.rotary_emb": "cpu",
"model.layers.17.mlp": "cpu",
"model.layers.17.input_layernorm": "cpu",
"model.layers.17.post_attention_layernorm": "cpu",
"model.layers.18": "cpu",
"model.layers.19": "cpu",
"model.layers.20": "cpu",
"model.layers.21": "cpu",
"model.layers.22": "cpu",
"model.layers.23": "cpu",
"model.layers.24": "cpu",
"model.layers.25": "cpu",
"model.layers.26": "cpu",
"model.layers.27": "cpu",
"model.layers.28": "cpu",
"model.layers.29": "cpu",
"model.layers.30": "cpu",
"model.layers.31": "cpu",
"model.layers.32": "cpu",
"model.layers.33": "cpu",
"model.layers.34": "cpu",
"model.layers.35": "cpu",
"model.layers.36": "cpu",
"model.layers.37": "cpu",
"model.layers.38": "cpu",
"model.layers.39": "cpu",
"model.norm": "cpu",
"lm_head": "cpu",
}
While I am here, I also get the following error message:
ValueError ... containing mote than one `.index.json` file, delete the irrelevant ones.
It seems the config has two .json files: pytorch_model.bin.index.json
and model.safetensors.index.json
. I have to delete the second one manually, every time I reload the model. Is there a way to specify the json file as an attribute or force using the pytorch config, so I don’t have to do this manual work?
Thank you for your help.