Train Large Dataset On The Cloud

Can someone help me I am trying to fine tune a GPT2 on a RunPod instance. Is there a better way to get my datasets loaded on my RunPod so I am not wasting time and money on the cloud. I was hoping that maybe I could load, tokenize, generate the splits all at the house then save it afterwards to save a large amount of time.

Right now I do the following:

  1. Load Model and tokenizer
  2. Load Dataset and tokenize it
  3. Train model
  4. Save model