How to reduce Inference API costs for long format text generation?

Hi, I’m considering building out some apps using the Inference API, with GPT-Neo and maybe GPT-2. It seems really awesome. Something I realized very fast is how quickly you could rack up a huge bill using the API though, especially when generating long format text. For example if you use the “Question Answering” example from https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api, each request is a little bit more than 1,000 characters. Which for the supporter plan ($25/M input characters), would equal out to about two and a half cents per generation, which for any app would add up SUPER quick. If a user of your app did 10 generations, it would be 25 cents! If that user used your apps every week, it would be a dollar per month, per user. And that’s assuming they don’t do more than ten generations, for some things only allowing ten generations wouldn’t be enough. To use it for any app that doesn’t have monthly fees from users would be impossible. Even if you eventually did switch to the Startup, $599/mo plan, it would still be 1 cent per generation, which would be way to much for most apps, even just having ads on your app wouldn’t cover it. All of the payment plans for the Inference API are way more expensive for long form generation than GPT-3’s API, even the Davinci model.

I’m not trying to complain here, the inference API is awesome. Just looking for maybe some tips on how to reduce input length, or maybe something I am overall missing. Or maybe the inference API wouldn’t be right for non SaaS/low cost SaaS apps. Or perhaps I am looking at this all from the wrong perspective. Anything would help! Thank you.

3 Likes