Training GPT-2 with OSCAR Dataset in Dutch: Seeking Advice

Hello Hugging Face Community,

I’m currently embarking on training a GPT-2 model from scratch and have chosen the OSCAR dataset in Dutch for this purpose. The dataset size is similar to WebText, at around 48GB.

My main question revolves around the number of epochs needed for the model to effectively converge. Given the significant costs associated with training (I’m using 16 V100-16GB GPUs :sweat_smile:), I’m keen to gather insights or experiences from the community on optimizing this training process.

Current Training Setup

  • Batch Size: 8
  • Gradient Accumulation: 8
  • Number of GPUs: 16
  • Effective Batch Size: 1024

Training Calculations

  • Dataset Rows: 25,113,265
    • Results in 196,197 batches per machine (25,113,265 / 8 / 16)
    • Equates to 24,524 optimizer steps per epoch (196,197 / 8)
    • Approximately 45 hours per epoch on my current setup.

Optimization Strategy

  • Using CosineAnnealingLR in combination with AdamW(lr=1e-4).

I would greatly appreciate any thoughts, comments, or suggestions for better practices. Have any of you tackled similar challenges, and if so, how did you optimize your training process?

Thank you in advance for your insights and advice!