Llama 2 70B on a cpu

I was just using this model here on HuggingFace. I noticed that it referenced a cpu, which I didn’t expect at all. Here’s the link:

Beside the title it says:

“Running on cpu. Upgrade.”

Is this just the endpoint running on a CPU?

No its running with inference endpoints which is probably running with several powerful gpus(a100). Running a 70b model on cpu would be extremely slow and take over 100 gb ram.

I don’t know why its running on cpu upgrade however.

I hava test use llama.cpp infer Llama2 7B、13B 70B on different CPU

The fast 70B INT8 speed as 3.77 token /s ( AMD 9654P 96C/768G memory)

run command:
build/bin/main -m /home/apps/models/wizardlm-70b-q8_0.bin -gqa 8 -eps 1e- 5 -t 96 -n 1024 –repeat_penalty 1.0 –color -c 512 –temp 0.6 -p “Please introduce me something about vipshop holdings ltd.”

token speed:

llama_print_timings: load time = 1576.96 ms
llama_print_timings: sample time = 10.12 ms / 396 runs ( 0.03 ms per token, 39114.97 tokens per second)
llama_print_timings: prompt eval time = 1500.66 ms / 15 tokens ( 100.04 ms per token, 10.00 tokens per second)

llama_print_timings: eval time = 104635.98 ms / 395 runs ( 264.90 ms per token, **3.77 tokens per second**)

It’s make sense:)