Is there a way to finetune GPT2 775M on 16GB VRAM and 24GB RAM?

I was able to finetune GPT2 355M 2048 sequence, without FP16, all fit in VRAM.

But no luck with GPT2 755M. Obviously didn’t fit to VRAM, so I used FP16 and DeepSpeed CPU offload, That way I got 9GB VRAM free, but out of RAM.

Did someone succeed with GPT2 training for 775M with 16GB VRAM?

Now I’m able to run training with block size 1568.
Free VRAM 1593MiB, free RAM 14GB

Feel like there is a room to run block size 2048…