I’d like to batch process 5mm prompts using this llama 2 based model:
If I deploy to inference endpoints, I see that each inference call takes around 10-20seconds. This means that my model will take 3-5 years to process every prompt.
How can I scale the inference to do 5mm rows at the same time for a reasonable cost? Am I simply out of luck? The cost using gpt-3.5 turbo for my task would be 5K.
What do you recommend?