Hi everyone!
I’m fine-tuning a GPT-2 model on a dataset with a size of 500,000 steps. What proportion of the training steps would be logical as the start-up point to consider for warm-up steps?
Hi everyone!
I’m fine-tuning a GPT-2 model on a dataset with a size of 500,000 steps. What proportion of the training steps would be logical as the start-up point to consider for warm-up steps?