I also this question in StackOverflow, but couldn’t get a response yet (https://stackoverflow.com/questions/64102036/does-using-fp16-help-accelerate-generation-huggingface-bart).
I follow the guide below to use FP16 in PyTorch.
Basically, I’m using BART in HuggingFace for generation
- During the training phase, I’m able to get 2x speedup and less GPU memory consumption
- I found out there is no speedup when I call
with torch.cuda.amp.autocast(): model.generate(...)
- When I save the model by:
the size does not decrease to half. But I have to call
model.half() before saving in order to make the model half size.
Thus, my questions:
- Is the issue in
1.expected or there should be something I did wrong?
- Is the operation I did in