Does accelerate.prepare() destroy model weights even if --model_name_or_path is specified and model is loaded?

I am running into a confusing message being printed by HuggingFace Accelerate while fine-tuning models with Megatron-LM plugin. With a Megatron-LM model config file defined, I use the following accelerate command line options:

--model_name_or_path "cerebras/Cerebras-GPT-13B" \    # Download from HuggingFace Hub
--tokenizer_name "gpt2"  \                            # Download from HuggingFace Hub
--dataset_name "wikitext" \
--dataset_config_name "wikitext-2-raw-v1" \
--block_size 1024 \
--learning_rate 5e-5 \
--lr_scheduler_type "cosine_with_restarts" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--num_train_epochs 4 \
--output_dir "output_gpt_megatron" \
--report_to "tensorboard" \
--seed 8306

and the following code to load the model:

model = AutoModelForCausalLM.from_pretrained(
                args.model_name_or_path,
                config=config
            )

I am getting the following message while fine-tuning with HF-Hub datasets (or even local data files),

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [01:12<00:00, 36.31s/it]
All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at cerebras/Cerebras-GPT-13B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
Generation config file not found, using a generation config created from the model config.

...

 > padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)
Building gpt model in the pre-training mode.
The Megatron LM model weights are initialized at random in `accelerator.prepare`. Please use `accelerator.load_checkpoint` to load a pre-trained checkpoint matching the distributed setup.

...

The last message in the output above come from https://github.com/huggingface/accelerate/blob/60856787acfb26365bb06139c21c4742af74158b/src/accelerate/utils/megatron_lm.py that reads,

def model_provider_func(pre_process=True, post_process=True, add_encoder=True, add_decoder=True):
    """Build the model."""
    args = get_args()
    mode = "pre-training" if args.pretraining_flag else "fine-tuning"
    if args.rank == 0:
        print(f"Building {args.model_type_name} model in the {mode} mode.")
        print(
            "The Megatron LM model weights are initialized at random in `accelerator.prepare`. "
            "Please use `accelerator.load_checkpoint` to load a pre-trained checkpoint matching the distributed setup."
        )
    if args.model_type_name == "bert":
        if args.pretraining_flag:
            num_tokentypes = 2 if args.bert_binary_head else 0
            model = BertModel(...)
        else:
            model = Classification(...)
    elif args.model_type_name == "gpt":
        model = GPTModel(...)
    elif args.model_type_name == "t5":
        model = T5Model(...)
    else:
        raise ValueError(f"Unsupported model type: {args.model_type_name}")
    return model

My question is this:

Does accelerator.prepare(...) destroy the model weights that are loaded using

model = AutoModelForCausalLM.from_pretrained(
                args.model_name_or_path,
                config=config)

and reinitialize the model weights with random ones?

In my understanding, accelerate.prepare(…) wraps the model with the proper plugin architecture and/or redistribute the model weights and layers to different compute devices.

Clarity on this issue is much appreciated.

Note that in Megatron-LM, it is mentioned in point 4 in the caveats section that β€œIn accelerator.prepare call, a Megatron-LM model corresponding to a given Transformers model is created with random weights. Please use accelerator.load_state to load the Megatron-LM checkpoint with matching TP, PP and DP partitions.”

Does this mean that we need to convert the PyTorch model for cerebras/Cerebras-GPT-13B using the interoperability script to Megatron-LM format?

answered here: Does accelerate.prepare() destroy model weights even if --model_name_or_path is specified and model is loaded? Β· Issue #1610 Β· huggingface/accelerate (github.com)