Scaling batch inference for Longformer model

benhamner · December 19, 2022, 8:59pm

Does anyone have experience running Longformer for inference at scale (millions of docs)?

I’m interested in:

What GPU architecture + software library would maximize throughput in batches per node?
If GPU cloud costs were taken into account, would the setup that maximizes cost efficiency be different than the one that maximizes throughput?

Topic		Replies	Views
Challenges with Real-time Inference at Scale Beginners	0	29	February 12, 2025
New: Distributed GPU Platform Research	2	665	November 8, 2023
Having issues with running parallel, independent inferences on multiple GPUs Beginners	0	239	September 10, 2024
How to do distributed Inference for large models with multiprocess? 🤗Accelerate	3	632	May 26, 2024
Data Parallelism for multi-GPUs Inference Intermediate	0	548	October 26, 2022