Trainer and Accelerate

Indramal · November 20, 2022, 7:39am

What are the differences and if Trainer can do multiple GPU work, why need Accelerate?

Accelerate use only for custom code? (add or remove something)

brando · June 28, 2023, 10:19pm

I assume accelerate was added later and has more features like:

"""
Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just
four lines of code!

tldr; handles all from cpu-gpu(s)-multi-node-tpu-tpu + deepseed + mixprecision in one simple wrapper without complicated
calls e.g. that ddp has to do for multi gpus.

ref: my notes: https://www.evernote.com/shard/s410/sh/f1158fa5-4122-0d17-d6eb-a920461e12b6/g47Qtu6j1F58zvMnJ3fWY8v6pFFWi3I_krn5155UigRUmBzr-D8td5HaQA
"""

brando · June 28, 2023, 10:20pm

related: Trainers.train() with accelerate

muellerzr · June 28, 2023, 11:46pm

The Trainer now uses accelerate as the backbone for it (our work the last few months) so it’s "do you want raw accelerate? Or the Trainer API). The capabilities are the same overall

brando · July 12, 2023, 10:21pm

just saw this. Pasting this as a ref: Hugging Face Trainer? · Issue #144 · huggingface/accelerate · GitHub

brando · July 12, 2023, 11:27pm

answer:

Since the trainer already has created an accelerator obj inside it’s own code you have to do no code changes except for writing your own accelerate config and calling it as :

accelerate launch --config_file {path/to/config/my_config_file.yaml} {script_name.py} {--arg1} {--arg2} ...

An example config is given at the end.

Long answer

My assumption was that there would be code changes, since every other accelerate tutorial showed that e.g.,

+ from accelerate import Accelerator
  from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

+ accelerator = Accelerator()

  model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
  optimizer = AdamW(model.parameters(), lr=3e-5)

- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
- model.to(device)

+ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
+     train_dataloader, eval_dataloader, model, optimizer
+ )

  num_epochs = 3
  num_training_steps = num_epochs * len(train_dataloader)
  lr_scheduler = get_scheduler(
      "linear",
      optimizer=optimizer,
      num_warmup_steps=0,
      num_training_steps=num_training_steps
  )

  progress_bar = tqdm(range(num_training_steps))

  model.train()
  for epoch in range(num_epochs):
      for batch in train_dataloader:
-         batch = {k: v.to(device) for k, v in batch.items()}
          outputs = model(**batch)
          loss = outputs.loss
-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()
          lr_scheduler.step()
          optimizer.zero_grad()
          progress_bar.update(1)

but those code changes are already inside the Trainer. Their integration is so seamless it’s unclear, or perhaps it’s just not in the tutorials so one has to look at their trainer code e.g.,

if is_accelerate_available():
    from accelerate import __version__ as accelerate_version

    if version.parse(accelerate_version) >= version.parse("0.16"):
        from accelerate import skip_first_batches

    from accelerate import Accelerator
    from accelerate.uti

So just make an accelerate config and run it e.g.,

# -----> see this ref: https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config
# ref for fsdp to know how to change fsdp opts: https://huggingface.co/docs/accelerate/usage_guides/fsdp
# ref for accelerate to know how to change accelerate opts: https://huggingface.co/docs/accelerate/basic_tutorials/launch
# ref alpaca accelerate config: https://github.com/tatsu-lab/alpaca_farm/tree/main/examples/accelerate_configs

main_training_function: main  # <- change

deepspeed_config: { }
distributed_type: FSDP
downcast_bf16: 'no'
dynamo_backend: 'NO'
# seems alpaca was based on: https://huggingface.co/docs/accelerate/usage_guides/fsdp
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  #  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer  # <-change
  fsdp_transformer_layer_cls_to_wrap: FalconDecoderLayer  # <-change
#  fsdp_min_num_params:  7e9 # e.g., suggested heuristic: num_params / num_gpus = params/gpu, multiply by precision in bytes to know GBs used
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
megatron_lm_config: { }
#mixed_precision: 'bf16'
#mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Du5TyCh3N · July 15, 2023, 8:27pm

Hi can someone help me out?

I am working in Google Colab to learn how to finetune a ViT model, following this tutorial Vision Transformers (ViT) Explained + Fine-tuning in Python - YouTube.

Trying to define the following arguments

training and testing dataset
feature extractor
model
collate function
evaluation metric

But I have error defining TrainingArguments and Trainer, where it says
ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U

Where do I need to add accelerate? And how do I incorporate it with this code?

didi-ai · July 26, 2023, 9:42am

Have you solved this? I meet the same question:

muellerzr · July 26, 2023, 9:46am

After installing you need to restart the runtime so it loads the newer version

didi-ai · July 26, 2023, 9:59am

Thanks, I changed the runtime type to GPU and do not change any code, then the code is works.
The code still didn’t work on CPU.
Do you know why?

muellerzr · July 26, 2023, 10:47am

On the CPU version, do pip install accelerate -U, make sure it shows the latest, and do Runtime → Restart Runtime, then run the code again (skipping the install). Let me know if this still errors out

didi-ai · July 27, 2023, 10:25am

Thanks, It works on CPU as you say.

brando · September 27, 2023, 9:48pm

@muellerzr if I want to use only fsdp, do I need HF accelerate? how would I run my script?

pranil51 · September 19, 2024, 12:02pm

I have set accelerate configuration using ‘accelerate config’ in bash. Now does Trainer create some other accelerate configuration by its own?

Topic		Replies	Views
Besides writing your own training loop, is there any other advantage for using it with deepspeed? 🤗Accelerate	2	585	July 4, 2023
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	5	4150	September 10, 2024
Same number of optimizations steps with 1 GPU or 4 GPUs? 🤗Accelerate	0	332	March 11, 2023
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6396	May 31, 2023
What algorithm Trainer uses for multi GPU training (without torchrun) Beginners	1	910	January 19, 2023

Trainer and Accelerate

Long answer

Related topics