Infrastructure for pretraining and finetuning via accelerate

Hi, we are using the accelerate scripts for pretraining and fine tuning and would like to see if there are known DevOps-style tools that are compatible with them, especially in the context of multi-gpu or multi-host distributed learning.

We currently use custom versions of three of the example scripts: run_clm_no_trainer.py and run_mlm_no_trainer.py from ./examples/pytorch/language-modeling, along with run_glue_no_trainer.py from ./examples/pytorch/text-classification.

Of course, compatibility with the scripts themselves is less critical than the output files they produce (e.g., optimizer.bin and random_states_*.pkl in addition to pytorch_model.bin).

An an aside, note that we experimented with an AWS blog using Hugging Face in the context of SageMaker, but that will break the bank in a heartbeat. That was geared for reproducing web-scale pretraining quickly. Instead, we have a moderate number of domain-specific documents to process (e.g., less than 1 billion tokens).

Best,
Tom