Does anyone have experience running Longformer for inference at scale (millions of docs)?
I’m interested in:
- What GPU architecture + software library would maximize throughput in batches per node?
- If GPU cloud costs were taken into account, would the setup that maximizes cost efficiency be different than the one that maximizes throughput?