I want to pre-train a Decoder (Causal Model) model with less than 7B (since 7B and above are unstable during training, I want to guarantee to the best of my abilities that the pre-training will go smoothly with minimum baby sitting).
Given how nice the pre-training curves for LLaMA v2 (llama2) are I will try that.
What I need is:
- be able to initialize a llama 2 architecture with less parameters (e.g., decreasing the width or decreasing the layers)
- then randomly initialize it.
How do I do the above?
Some initial code:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
torch_dtype = torch.bfloat16
torch_dtype = torch.float32
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
bf16=torch.cuda.get_device_capability(torch.cuda.current_device())[0] >= 8, # if >= 8 ==> brain float 16 available or set to True if you always want fp32
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path,
# quantization_config=quantization_config,
# device_map=device_map, # device_map = None https://github.com/huggingface/trl/blob/01c4a35928f41ba25b1d0032a085519b8065c843/examples/scripts/sft_trainer.py#L82
trust_remote_code=True,
torch_dtype=torch_dtype,
use_auth_token=True,
)
print(f'{pretrained_model_name_or_path=}')
tokenizer = AutoTokenizer.from_pretrained(model, torch_dtype=torch_dtype, use_auth_token=True)
and
from llama.models.llama_config import LLaMAConfig
class SmallLlamaConfig(LLaMAConfig):
hidden_size = 2048
num_layers = 24
config = SmallLlamaConfig()
model = AutoModelForCausalLM.from_config(config,
random_init=True,
load_in=['conv1', 'layer_norm'])
print(model)
Maybe this?
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
import torch
# Load the configuration of the LLaMA v2 model
config = AutoConfig.from_pretrained('meta-llama/Llama-2-7b-hf')
# Modify the configuration to reduce the model size
config.n_embd = 768 # Decrease for a smaller width
config.n_layer = 8 # Decrease for fewer layers
config.n_head = 12 # Adjust the number of attention heads if needed
# Initialize a model with the modified configuration
model = AutoModelForCausalLM(config=config)
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
# If you want to use a specific dtype
torch_dtype = torch.float32
model.to(dtype=torch_dtype)
note: a different model with the above conditions could work, but llama 2 is the ideal answer I think.
ref so: huggingface transformers - How to get a LLaMA v2 model with less than 7B parameters? - Stack Overflow
ref hf: How to get a LLaMA v2 model with less than 7B parameters?
ref dis: Discord