Thanks, I updated the volume size and added checkpointing. It seems the job fails before I complete the first epoch though. My training data consists of 1.7M short text descriptions (~100 MB) and 23 classes. Would a distributed approach help here? Like in this post