I am comparing HuggingFace inference endpoints with competitors. It would be immensely useful to have an estimate of how many tokens per second we can expect to produce. I understand that this is difficult to estimate, as it depends on a large number of factors, including model size, batch size, chosen instance, etc. Still I would like to get a rough idea of what we can expect and I’m having a hard time finding any reliable sources on this.
Would it be possible to state here (or give a reference) to the expected tokens per second with settings similar to the following?
batch_size = 1
instance = “aws NVIDIA T4 1 14GB”
model = “togethercomputer/RedPajama-INCITE-Instruct-3B-v1”
Feel free to adjust the settings to something more convenient.
You can should be able to try it yourself to get a rough estimation.
Fair enough, I may do so. Let me still encourage you to put an estimate of tokens per second somewhere in your documentation for the benefit of potential users.
I can’t give you exact timing for your specific model, but to give an estimation. Running the OpusMT english to spanish translation model, on the A10G AWS instance. I ran 100 requests through it, average token size of a request was 40 tokens(all were in the range of 25-55 tokens). Roughly ~175 characters of input per request. The avg request time was ~2 seconds, the p95 was ~4 seconds. So 10-20 tokens per second. FYI, these are the numbers reported in the huggingface inference metrics panel, not end to end times from the client, so it’s likely just the server side time being measured.