I run other stuff like faceswap and stable diffusion, but here is what I’m getting with these models using chat or chat instruct. I’m just curious if anyone is using similar hardware, or if they are getting similar numbers with different hardware.
I’m using sub $400 used servers off of amazon to get these results. I have another one coming that has more ram and SSD’s, I’m anxious to try.
if anyone has any questions or advice, I would be interested to hear what you have to say.
Models being run on:
VMware Virtual Machine: Ubuntu 22.04.3
CPU Cores 34
Sockets 34
Core per Socket 1
IOMMU Enabled
Memory 45 GB
Hard disk 1 x 400 GB
Total Cost $355
Dell Poweredge R630 Server
Processor 2 x Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
2 Sockets 18 Cores per socket
Total Threads 36
72 Logical processors
Ram 63.91 GB
6 x 1.6TB SAS HDD in RAID 10.
OS ESXi version: 8.0.1
llama.cpp
llama2 7b Q4 K M GGUF Avg. 4.75 Tokens/s
llama2 7b Q5 K M GGUF Avg. 4.05 Tokens/s
Orca 2 7b Q5 K M GGUF Avg. 3.97 Tokens/s
Orca 2 13b Q5 K M GGUF Avg. 2.27 Tokens/s
Platypus yi 34b Q4 K M GGUF Avg. 1.06 Tokens/s