Llama 2 70B on a cpu

malikarumi · July 28, 2023, 2:09pm

I was just using this model here on HuggingFace. I noticed that it referenced a cpu, which I didn’t expect at all. Here’s the link:

Beside the title it says:

“Running on cpu. Upgrade.”

Is this just the endpoint running on a CPU?

YaTharThShaRma999 · July 28, 2023, 4:19pm

No its running with inference endpoints which is probably running with several powerful gpus(a100). Running a 70b model on cpu would be extremely slow and take over 100 gb ram.

I don’t know why its running on cpu upgrade however.

xinchun · August 23, 2023, 1:37am

I hava test use llama.cpp infer Llama2 7B、13B 70B on different CPU

The fast 70B INT8 speed as 3.77 token /s ( AMD 9654P 96C/768G memory)

run command:
build/bin/main -m /home/apps/models/wizardlm-70b-q8_0.bin -gqa 8 -eps 1e- 5 -t 96 -n 1024 –repeat_penalty 1.0 –color -c 512 –temp 0.6 -p “Please introduce me something about vipshop holdings ltd.”

token speed:

llama_print_timings: load time = 1576.96 ms
llama_print_timings: sample time = 10.12 ms / 396 runs ( 0.03 ms per token, 39114.97 tokens per second)
llama_print_timings: prompt eval time = 1500.66 ms / 15 tokens ( 100.04 ms per token, 10.00 tokens per second)

llama_print_timings: eval time = 104635.98 ms / 395 runs ( 264.90 ms per token, **3.77 tokens per second**)

It’s make sense:)

Topic		Replies	Views
Meta-Llama-3.1-70B-Instruct-IMat-GGUF Beginners	0	141	July 24, 2024
Llama 70b model not using GPU Models	0	1105	September 13, 2023
Llama 3 70b in the Chat UI Is Super Slow and Nearly Unusable Beginners	2	681	October 4, 2024
Llama3 so much slow compared to ollama 🤗Transformers	15	9665	February 28, 2025
Inference speed Spaces	0	369	September 17, 2023

Llama 2 70B on a cpu

Related topics