Infrastructure for pretraining and finetuning via accelerate

tohara-pandologic · March 13, 2023, 12:12am

Hi, we are using the accelerate scripts for pretraining and fine tuning and would like to see if there are known DevOps-style tools that are compatible with them, especially in the context of multi-gpu or multi-host distributed learning.

We currently use custom versions of three of the example scripts: run_clm_no_trainer.py and run_mlm_no_trainer.py from ./examples/pytorch/language-modeling, along with run_glue_no_trainer.py from ./examples/pytorch/text-classification.

Of course, compatibility with the scripts themselves is less critical than the output files they produce (e.g., optimizer.bin and random_states_*.pkl in addition to pytorch_model.bin).

An an aside, note that we experimented with an AWS blog using Hugging Face in the context of SageMaker, but that will break the bank in a heartbeat. That was geared for reproducing web-scale pretraining quickly. Instead, we have a moderate number of domain-specific documents to process (e.g., less than 1 billion tokens).

Best,
Tom

Topic		Replies	Views
Hugging Face Trainer class with accelerate 🤗Accelerate	2	388	May 21, 2024
Fine-Tuning / Pre-Training Tips 🤗Transformers	1	2950	August 5, 2022
Accelerate Multi-Node Training Beginners	1	7449	October 22, 2024
Distributed Inference with 🤗 Accelerate - Compare Baseline vs Fine Tuned Model 🤗Accelerate	3	525	January 30, 2024
Using deepspeed script launcher vs accelerate script launcher for TRL 🤗Accelerate	4	1888	January 24, 2024

Infrastructure for pretraining and finetuning via accelerate

Related topics