Caching issues with MarianMT

I’m trying to use examples/pytorch/translation/run_translation_no_trainer.py (from tag v4.46.0) and accelerate to finetune a MarianMT model with ~23M lines of bitext from Opus and I’ve noticed a couple things that I’d like to fix. Prior to training with all 23M lines, I did some initial tests with 2M and 8M lines of training, both of which completed without any issue. However, when I started using all 23M lines I started getting NCCL timeout errors while the data set was getting cached. I updated the timeout limit with:

kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=18000))
accelerator=Accelerator(kwargs_handlers=[kwargs])

which seems to do the trick but is there a way to speed up caching rather than just letting it run longer?

The second issue i hit is that the cached arrow files don’t seem to be getting used when i restart training and instead get regenerated every time. I saw this post Datasets' cache not re-used · Issue #3847 · huggingface/datasets · GitHub and this The cache but it wasn’t clear to me why the cache kept getting regenerated. Is it a bug or is it something I’m totally missing?

Any info would be much appreciated!

1 Like