Does using FP16 help accelerate generation? (HuggingFace BART)

allanjie · September 29, 2020, 2:40am

I also this question in StackOverflow, but couldn’t get a response yet (pytorch - Does using FP16 help accelerate generation? (HuggingFace BART) - Stack Overflow).

I follow the guide below to use FP16 in PyTorch.

Basically, I’m using BART in HuggingFace for generation

During the training phase, I’m able to get 2x speedup and less GPU memory consumption

But.

I found out there is no speedup when I call model.generate under torch.cuda.amp.autocast().

with torch.cuda.amp.autocast():
   model.generate(...)

When I save the model by:

model.save_pretrained("model_folder")

the size does not decrease to half. But I have to call model.half() before saving in order to make the model half size.

Thus, my questions:

Is the issue in 1. expected or there should be something I did wrong?
Is the operation I did in 2. proper?

valhalla · September 29, 2020, 9:06am

To use the model for inference in fp16 you should call model.half() after loading it.
Note that calling half puts all models weights in fp16, but in mixed precision training some parts are still kept in fp32 for stability (like softmax layers), so it might be a better idea to use amp in 01 opt mode instead of calling half.

I’m not sure if torch.cuda.amp.autocast() helps in inference, better to ask this on pytorch forum

allanjie · September 30, 2020, 3:11am

I actually tried using apex in 01 mode but it does not help the generation speed as well. But it did help improve the training speed of seq2seq bart model. Is it also in your case?

Topic		Replies	Views
How to train huggingface model with fp16? Beginners	1	1525	May 23, 2022
FP16 doesn't reduce Trainer Training time Amazon SageMaker	10	1824	June 29, 2023
Issues when using `accelerate` with `fp16` Intermediate	4	11934	January 22, 2024
Language generation with torchscript model? 🤗Transformers	6	2535	November 20, 2021
Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code Intermediate	5	4506	April 9, 2024

Does using FP16 help accelerate generation? (HuggingFace BART)

Related topics