How to reinitialize from scratch GPT2 XL in Hugging Face (HF)?

I’m trying to confirm that my GPT-2 model is being trained from scratch, rather than using any pre-existing pre-trained weights. Here’s my approach:

  1. Load the pre-trained GPT-2 XL model: I load a pre-trained GPT-2 XL model using AutoModelForCausalLM.from_pretrained("gpt2-xl") and calculate the total L2 norm of the weights for this model.
  2. Initialize a new GPT-2 model from scratch: I then initialize a new GPT-2 model from scratch with a custom configuration using GPT2Config.
  3. Compare L2 norms: I calculate the L2 norm of the weights for both the pre-trained model and the freshly initialized model. My assumption is that the L2 norm of the scratch model should be much smaller compared to the pre-trained model if the scratch model is truly initialized from random weights.

Here’s the code snippet:

import torch
from transformers import GPT2LMHeadModel, GPT2Config, AutoModelForCausalLM

# Step 1: Load the pre-trained GPT-2 XL model
pretrained_model = AutoModelForCausalLM.from_pretrained("gpt2-xl")

# Step 2: Calculate the L2 norm of the weights for the pre-trained model
pretrained_weight_norm = 0.0
for param in pretrained_model.parameters():
    pretrained_weight_norm += torch.norm(param, p=2).item()

print(f"Total L2 norm of pre-trained model weights: {pretrained_weight_norm:.2f}")

# Step 3: Initialize a new GPT-2 model from scratch with custom configuration
config = GPT2Config(
    vocab_size=52000,  # Ensure this matches the tokenizer's vocabulary size
    n_ctx=1024,  # Context window size (number of tokens the model can see at once)
    bos_token_id=0,  # Begin-of-sequence token
    eos_token_id=1,  # End-of-sequence token
)
model = GPT2LMHeadModel(config)

# Step 4: Calculate the L2 norm of the weights for the freshly initialized model
scratch_weight_norm = 0.0
for param in model.parameters():
    scratch_weight_norm += torch.norm(param, p=2).item()

print(f"Total L2 norm of model initialized from scratch: {scratch_weight_norm:.2f}")

Is this method a valid way to confirm that the model is being trained from scratch? Are there any potential issues or better ways to verify that the model has no pre-existing learned weights?

looks right

~/beyond-scale-language-data-diversity$ /opt/conda/envs/beyond_scale_div_coeff/bin/python /home/ubuntu/beyond-scale-language-data-diversity/playground/test_gpt2_pt_vs_reinit_scratch.py
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 689/689 [00:00<00:00, 8.05MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████| 6.43G/6.43G [00:29<00:00, 221MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 1.03MB/s]
Total L2 norm of pre-trained model weights: 24542.74
Total L2 norm of model initialized from scratch: 1637.31
(beyond_scale_div_coeff)                                                        

cross: python - How to reinitialize from scratch GPT2 XL in HuggingFace? - Stack Overflow

ref: training gpt2 xl from stratch? · Issue #18 · alycialee/beyond-scale-language-data-diversity · GitHub