Aside from buying a faster/larger GPU, are there any best practices for increasing text generation speed? Specifically, I’d like to use GPT-2 (of various sizes) to generate a large set of text (5,000 examples, 1,000 BPE tokens each).
My initial research shows there are a few options:
(a) DeepSpeed for inference
(b) batched generation
(c) fp16 inference. Although I’m not sure how to do this outside of Trainer. I could call model.half()
but it’s not clear to me if that’s the right way to go about this.
Any advice is appreciated!