How to batch process 5mm prompts of llama 2 using inference endpoints?

rshrott · July 30, 2023, 3:34pm

I’d like to batch process 5mm prompts using this llama 2 based model:

If I deploy to inference endpoints, I see that each inference call takes around 10-20seconds. This means that my model will take 3-5 years to process every prompt.

How can I scale the inference to do 5mm rows at the same time for a reasonable cost? Am I simply out of luck? The cost using gpt-3.5 turbo for my task would be 5K.

What do you recommend?

Topic		Replies	Views
Inference speed Spaces	0	371	September 17, 2023
Trying to setup Long-Context LLM endpoint Beginners	2	211	August 17, 2024
Inference optimization with HPC Research	2	582	January 8, 2024
Accelerating inference for local HuggingFacePipeline of Llama3 🤗Transformers	0	88	August 1, 2024
Using a paid inference end point to query llamaindex knowledge graph gives worse results than the free inference api Beginners	2	720	March 8, 2024

How to batch process 5mm prompts of llama 2 using inference endpoints?

Related topics