Best practices for improving text generation speed?

Aside from buying a faster/larger GPU, are there any best practices for increasing text generation speed? Specifically, I’d like to use GPT-2 (of various sizes) to generate a large set of text (5,000 examples, 1,000 BPE tokens each).

My initial research shows there are a few options:
(a) DeepSpeed for inference
(b) batched generation
(c) fp16 inference. Although I’m not sure how to do this outside of Trainer. I could call model.half() but it’s not clear to me if that’s the right way to go about this.

Any advice is appreciated!

4 Likes