Expected speeds for CPU inference?

Using the model mosaicml/mpt-30b-instruct via Transformers in Python or in Oogabooga, we’re getting a generation speed of about 0.04 tokens per second with the test context of simply “hello”.

On a Ubuntu 22.04 server with 8x Cores at 3.7GHz, supporting 16 threads, with 128GB of DDR4 RAM, is this really the best I can expect, or do we have something seriously misconfigured?

We are getting similar speeds with other models, what should we be seeing for text generation speeds?

I believe, you can probably go faster but not much faster since you might be using a full 32 bit or 16 bit model. Since you are also using cpu, its also much slower. However, I would recommend either a smaller mpt model, or a quantized version of mpt 30b instruct so it could be several times faster.

What exactly is a quantized version?
What model could we use that’s much faster?

Our goal is gathering relevant content from web scrapes if that matters. We’re using ChatGPT 3.5 Turbo 16k, but the speed limits are killing us, so we have to build a our own solution.

Any solution that doesn’t require a $30,000 Nvidia GPU would be preferred :slight_smile:

A quantized version is basically a model that gets compressed. The current model you are using is 32 bit and uses 32 bits.
A quantized model uses less than 32 bits and uses 16 or 8 or 4 or even 2 bits. The smaller the amount of bits, the faster it would be and also use less ram but might have slightly lower accuracy.

However, a model thats 16 bits, would be 2 times faster than 32 bits. A model thats 8 bits, would be 4 times faster than 32 bits and 4 bits would be 8 times faster.
A quantized version of your model is TheBloke/mpt-30B-instruct-GGML.

Also, there are lots of smaller models that can perform as well but it might be less accurate for your purpose. Some I tested that were great, were TheBloke/WizardLM-13B-V1.0-Uncensored-GGML, TheBloke/wizard-vicuna-13B-GGML. If you want HF format, TheBloke/Wizard-Vicuna-13B-Uncensored-HF.

Worth experimenting with N_THREAD, I found it can make a significant difference. Start with the number of cores (not threads).

Is there an 8 or 16 bit model that can handle 16k context?

Will experiment with n_threads, thanks.

I dont think mpt 30b has a 16k context and 99% models have like 2k context only but i did found a model that could be used for 16k context. This one, lmsys/longchat-13b-16k. Its not quantized but smaller, so it would be much faster. Also, depending on what your content is, i don’t think you would need massive llm’s.

I couldn’t get that model to work in Oobabooga, and I’m not yet knowledgeable enough to figure out what I’m doing wrong just using Transformers in regular Python.

We’ve leased an Nividia A10 with 16 GB of VRAM, everything is so much faster with a GPU, but I still can’t use any of the larger models apparently.