I also this question in StackOverflow, but couldn’t get a response yet (pytorch - Does using FP16 help accelerate generation? (HuggingFace BART) - Stack Overflow).
I follow the guide below to use FP16 in PyTorch.
Basically, I’m using BART in HuggingFace for generation
- During the training phase, I’m able to get 2x speedup and less GPU memory consumption
But.
- I found out there is no speedup when I call
model.generate
undertorch.cuda.amp.autocast()
.
with torch.cuda.amp.autocast():
model.generate(...)
- When I save the model by:
model.save_pretrained("model_folder")
the size does not decrease to half. But I have to call model.half()
before saving in order to make the model half size.
Thus, my questions:
- Is the issue in
1.
expected or there should be something I did wrong? - Is the operation I did in
2.
proper?