Does using FP16 help accelerate generation? (HuggingFace BART)

I also this question in StackOverflow, but couldn’t get a response yet (pytorch - Does using FP16 help accelerate generation? (HuggingFace BART) - Stack Overflow).

I follow the guide below to use FP16 in PyTorch.

Basically, I’m using BART in HuggingFace for generation

  1. During the training phase, I’m able to get 2x speedup and less GPU memory consumption

But.

  1. I found out there is no speedup when I call model.generate under torch.cuda.amp.autocast().
with torch.cuda.amp.autocast():
   model.generate(...)
  1. When I save the model by:
model.save_pretrained("model_folder")

the size does not decrease to half. But I have to call model.half() before saving in order to make the model half size.

Thus, my questions:

  • Is the issue in 1. expected or there should be something I did wrong?
  • Is the operation I did in 2. proper?
1 Like

To use the model for inference in fp16 you should call model.half() after loading it.
Note that calling half puts all models weights in fp16, but in mixed precision training some parts are still kept in fp32 for stability (like softmax layers), so it might be a better idea to use amp in 01 opt mode instead of calling half.

I’m not sure if torch.cuda.amp.autocast() helps in inference, better to ask this on pytorch forum

I actually tried using apex in 01 mode but it does not help the generation speed as well. But it did help improve the training speed of seq2seq bart model. Is it also in your case?