Estimating tokens per second

vincentmin · June 19, 2023, 4:00pm

I am comparing HuggingFace inference endpoints with competitors. It would be immensely useful to have an estimate of how many tokens per second we can expect to produce. I understand that this is difficult to estimate, as it depends on a large number of factors, including model size, batch size, chosen instance, etc. Still I would like to get a rough idea of what we can expect and I’m having a hard time finding any reliable sources on this.

Would it be possible to state here (or give a reference) to the expected tokens per second with settings similar to the following?
batch_size = 1
instance = “aws NVIDIA T4 1 14GB”
model = “togethercomputer/RedPajama-INCITE-Instruct-3B-v1”

Feel free to adjust the settings to something more convenient.

philschmid · June 19, 2023, 4:15pm

You can should be able to try it yourself to get a rough estimation.

vincentmin · June 19, 2023, 9:10pm

Fair enough, I may do so. Let me still encourage you to put an estimate of tokens per second somewhere in your documentation for the benefit of potential users.

pejrich · June 27, 2023, 7:48am

I can’t give you exact timing for your specific model, but to give an estimation. Running the OpusMT english to spanish translation model, on the A10G AWS instance. I ran 100 requests through it, average token size of a request was 40 tokens(all were in the range of 25-55 tokens). Roughly ~175 characters of input per request. The avg request time was ~2 seconds, the p95 was ~4 seconds. So 10-20 tokens per second. FYI, these are the numbers reported in the huggingface inference metrics panel, not end to end times from the client, so it’s likely just the server side time being measured.

Topic		Replies	Views
My inference endpoint went from 1 second to 20-30 seconds, even example Beginners	2	32	February 25, 2025
Organization Pricing Beginners	1	410	February 22, 2021
How Can I Understand the Exact Cost of My Inference API Requests? Intermediate	2	147	April 16, 2025
Integration and Scale Inference Endpoints on the Hub	2	54	September 11, 2024
HuggingFace Inference endpoint 504 error Inference Endpoints on the Hub	3	805	January 30, 2024

Estimating tokens per second

Related topics