AttributeError: 'AcceleratorState' object has no attribute 'distributed_type', Llama 2 70B Fine-tuning, using 'accelerate' on a single GPU

’m trying to fine-tune the 70B Llama 2 model using the llama-recipes/examples/quickstart.ipynb file, on my sing 4090 GPU server with 24GB VRAM (which is an online rented one from HostKey for this purpose).

The quickstart.ipynb file says “This notebook shows how to train a Llama 2 model on a single GPU (e.g. A10 with 24GB) using int8 quantization and LoRA”. However, I could only train the 7B and 13B model with it. (For training the 13B modle, I even had to change 'per_device_train_batch_size': 1 ).

However, for training the 70B, I keep running into “Out Of Memory Error: CUDA out of memory”. Therefore, I decided to use ‘accelerate’.

For example I updated the first code cell with ‘accelerate’ as below, and it no longer gives the Out Of Memory Error because ‘accelerate’ is a library designed to simplify the usage of hardware accelerators for training and deploying deep learning models.

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
from accelerate import Accelerator


# Initialize tokenizer from pretrained model
model_id = "../Llama-2-70b-hf"

# Initialize Accelerator
accelerator = Accelerator()

#
tokenizer = LlamaTokenizer.from_pretrained(model_id)

# Initialize an empty skeleton of the model
with init_empty_weights():
    model = LlamaForCausalLM.from_pretrained(model_id)

# Specify the folder to offload model parts to disk
offload_folder = "../offload_folder"

# Load checkpoint and dispatch
checkpoint_file = "../Llama-2-70b-hf/pytorch_model.bin.index.json"
model = load_checkpoint_and_dispatch(
    model, 
    checkpoint=checkpoint_file, 
    device_map="auto",
    offload_folder=offload_folder  # Include the offload_folder here
)

# Use lower precision 
model = model.to(dtype=torch.float16)

model, tokenizer = accelerator.prepare(model, tokenizer)

However, later on, I run into a problem on the Fine-tune cell block since I’m using the latest version of ‘accelerate’ (not the old accelerate 0.15.0 & transformers 4.28.1, I run into a problem which says: AttributeError: 'AcceleratorState' object has no attribute 'distributed_type' when I run the below code.

from transformers import default_data_collator, Trainer, TrainingArguments

# Define training args
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    bf16=True,  # Use BF16 if available
    push_to_hub=False,  # make sure to include this for Accelerate compatibility
    # logging strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=10,
    save_strategy="no",
    optim="adamw_torch_fused",
    max_steps=total_steps if enable_profiler else -1,
    **{k:v for k,v in config.items() if k != 'lora_config'}
)

# Remember to prepare your data_collator using accelerator if you've defined any
data_collator = default_data_collator
data_collator = accelerator.prepare(data_collator)

# Also, modify the Trainer call to include the accelerator's device
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
 #   eval_dataset=eval_dataset,
    tokenizer=tokenizer,
 #   devices=accelerator.device_count,
    callbacks=[profiler_callback] if enable_profiler else None, # Include the profiler callback if profiling is enabled
 #   compute_metrics=compute_metrics, # Include this if you have defined a metrics function
)

train_results = trainer.train()

Here is the full error message I get for the above code cell:

AttributeError                            Traceback (most recent call last)
Cell In[31], line 21
      4 training_args = TrainingArguments(
      5     output_dir=output_dir,
      6     overwrite_output_dir=True,
   (...)
     16     **{k:v for k,v in config.items() if k != 'lora_config'}
     17 )
     19 # Remember to prepare your data_collator using accelerator if you've defined any
     20 #data_collator = default_data_collator
---> 21 data_collator = accelerator.prepare(data_collator)
     23 # Also, modify the Trainer call to include the accelerator's device
     24 trainer = Trainer(
     25     model=model,
     26     args=training_args,
   (...)
     33  #   compute_metrics=compute_metrics, # Include this if you have defined a metrics function
     34 )

File ~/Desktop/llama-2_for_70B/llama/70B_env/lib/python3.10/site-packages/accelerate/accelerator.py:1142, in Accelerator.prepare(self, device_placement, *args)
   1137 elif len(device_placement) != len(args):
   1138     raise ValueError(
   1139         f"`device_placement` should be a list with {len(args)} elements (the number of objects passed)."
   1140     )
-> 1142 if self.distributed_type == DistributedType.FSDP:
   1143     model_count = 0
   1144     optimizer_present = False

File ~/Desktop/llama-2_for_70B/llama/70B_env/lib/python3.10/site-packages/accelerate/accelerator.py:468, in Accelerator.distributed_type(self)
    466 @property
    467 def distributed_type(self):
--> 468     return self.state.distributed_type

AttributeError: 'AcceleratorState' object has no attribute 'distributed_type'

I was also getting Out Of Memory Error with the prepare_model_for_int8_training function from peft, and I couldn’t use ‘accelerate’? What could be the solution? Can I comment out the below prepare_model_for_int8_training part, and would the fine-tuning still work?

model = prepare_model_for_int8_training(model)

I don’t care how long it takes as I can live with it being slow to fine-tune/train, as long as it works.