How to batch process 5mm prompts of llama 2 using inference endpoints?

I’d like to batch process 5mm prompts using this llama 2 based model:

If I deploy to inference endpoints, I see that each inference call takes around 10-20seconds. This means that my model will take 3-5 years to process every prompt.

How can I scale the inference to do 5mm rows at the same time for a reasonable cost? Am I simply out of luck? The cost using gpt-3.5 turbo for my task would be 5K.

What do you recommend?

2 Likes