I was able to finetune GPT2 355M 2048 sequence, without FP16, all fit in VRAM.
But no luck with GPT2 755M. Obviously didn’t fit to VRAM, so I used FP16 and DeepSpeed CPU offload, That way I got 9GB VRAM free, but out of RAM.
Did someone succeed with GPT2 training for 775M with 16GB VRAM?