How to optimise transformer speed for batches of inputs?

Hi all! I have DistilBert fine-tuned for a sequence classification task, and am struggling with using the fine-tuned model to classify large batches of input.

Tokeniser runs very quickly (2k it/s) but actually applying the model to tokenized input is very slow (15 it/s).

Does anyone know of resources / best practices on how to optimise performance?