Scaling batch inference for Longformer model

Does anyone have experience running Longformer for inference at scale (millions of docs)?

I’m interested in:

  • What GPU architecture + software library would maximize throughput in batches per node?
  • If GPU cloud costs were taken into account, would the setup that maximizes cost efficiency be different than the one that maximizes throughput?