Caching issues with MarianMT

jesteinbe · November 1, 2024, 7:06pm

I’m trying to use examples/pytorch/translation/run_translation_no_trainer.py (from tag v4.46.0) and accelerate to finetune a MarianMT model with ~23M lines of bitext from Opus and I’ve noticed a couple things that I’d like to fix. Prior to training with all 23M lines, I did some initial tests with 2M and 8M lines of training, both of which completed without any issue. However, when I started using all 23M lines I started getting NCCL timeout errors while the data set was getting cached. I updated the timeout limit with:

kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=18000))
accelerator=Accelerator(kwargs_handlers=[kwargs])

which seems to do the trick but is there a way to speed up caching rather than just letting it run longer?

The second issue i hit is that the cached arrow files don’t seem to be getting used when i restart training and instead get regenerated every time. I saw this post Datasets' cache not re-used · Issue #3847 · huggingface/datasets · GitHub and this The cache but it wasn’t clear to me why the cache kept getting regenerated. Is it a bug or is it something I’m totally missing?

Any info would be much appreciated!

Topic		Replies	Views
Using MarianModel's in pytorch is too slow to do back translation (not parallelised correctly) Beginners	8	2120	December 22, 2020
Slow inference for translation Beginners	0	182	April 22, 2024
Issues with save_pretrained (MarianMT) Beginners	1	656	April 11, 2023
How to force caching of previously tokenized data? (run_clm.py) Beginners	3	676	November 21, 2023
Speeding up the inference for marian MT 🤗Transformers	4	2761	April 8, 2024

Caching issues with MarianMT

Related topics