Hello,
for anyone interested in the answer. This is the expected training speed for the provided hardware and model size. What I ended up doing to improve the speed substantially is:
- Lower the context length from 2048 to 512.
- Use mixed precision training.
- Using a quantized optimizer.
Step 1 had the biggest impact on training speed. I trained with the lowered context window for ~95% of the data I had and then increased it back to 2048 for the remaining 5%.