I was just using this model here on HuggingFace. I noticed that it referenced a cpu, which I didn’t expect at all. Here’s the link:
Beside the title it says:
“Running on cpu. Upgrade.”
Is this just the endpoint running on a CPU?
I was just using this model here on HuggingFace. I noticed that it referenced a cpu, which I didn’t expect at all. Here’s the link:
Beside the title it says:
“Running on cpu. Upgrade.”
Is this just the endpoint running on a CPU?
No its running with inference endpoints which is probably running with several powerful gpus(a100). Running a 70b model on cpu would be extremely slow and take over 100 gb ram.
I don’t know why its running on cpu upgrade however.
I hava test use llama.cpp infer Llama2 7B、13B 70B on different CPU
The fast 70B INT8 speed as 3.77 token /s ( AMD 9654P 96C/768G memory)
run command:
build/bin/main -m /home/apps/models/wizardlm-70b-q8_0.bin -gqa
8
-eps 1e-
5
-t
96
-n
1024
–repeat_penalty
1.0
–color -c
512
–temp
0.6
-p
“Please introduce me something about vipshop holdings ltd.”
token speed:
llama_print_timings: load time =
1576.96
ms
llama_print_timings: sample time =
10.12
ms /
396
runs (
0.03
ms per token,
39114.97
tokens per second)
llama_print_timings: prompt eval time =
1500.66
ms /
15
tokens (
100.04
ms per token,
10.00
tokens per second)
llama_print_timings: eval time =
104635.98
ms /
395
runs (
264.90
ms per token,
**3.77
tokens per second**)
It’s make sense:)