Finetuning neox 20b, why is resulting model so small

ahomosapiens · September 19, 2022, 1:05pm

Hello, I am training neox 20B on a custom text corpus. It seems when you use the train.py function, it makes a whole new model that is much smaller than the original 39gb slimmed weights model. I had the “finetune” config in yml set to true, though the documentation suggests finetune would automatically be set to true anyway when training based on a “release model”. Which I am supposing means, the slimmed down 20B weights that I downloaded.

For example using the example small.yml, I trained for 5000 iterations, and it outputted a checkpoint consisting of a 2g size model that can be loaded all by itself and used to generate outputs:

2223080 ./global_step5000

I was able to generate inference results using this checkpoint.

Now with fine tuning, my understanding is that we were supposedly training only the output “head” while leaving the remainder of the model unchanged. If that’s the case, how is the resultant model only 2g? Shouldn’t it be whatever size the non-head portion of the original model was (idk… maybe 30 g?) and then the new head, which would be a lot smaller? So I dont see how the whole new model fits into 2g.

Does that 2g model contain a slimmed down representation of what the original 39gb model was? I dont think so… it seems it is making a whole new model. There was no reference to the original model in any of the config files so I think it has nothing to work with for inference other than those 2 gigabytes of weights.

My theories are, either I am doing it wrong (very possible), or the “fine-tuning” process that is occurring in this case differs from my understanding.

Here are some of my configs that may be relevant:

“data-path”: “/home/ubuntu/data/train_text_document”,
“vocab-file”: “data/gpt2-vocab.json”,
“merge-file”: “data/gpt2-merges.txt”,
“finetune”: True,
“save”: “checkpoints”,
“load”: “checkpoints”,
“checkpoint_validation_with_forward_pass”: False,
“pipe-parallel-size”: 4,
“model-parallel-size”: 2,

model settings

“num-layers”: 12,
“hidden-size”: 768,
“num-attention-heads”: 12,
“seq-length”: 2048,
“max-position-embeddings”: 2048,
“norm”: “layernorm”,
“pos-emb”: “rotary”,
“no-weight-tying”: true,

Thanks for any help!

ahomosapiens · September 19, 2022, 4:31pm

Also wondering, about how large a model is created by the “large.yml” settings, which are as follows. The documentation indicated that the “small.yml” settings yielded a model of about 160M parameters; not sure how that calculation is done and what the similar calculation would be for “large.yml”.

model settings

“num-layers”: 24,
“hidden-size”: 1536,
“num-attention-heads”: 16,
“seq-length”: 2048,
“max-position-embeddings”: 2048,
“norm”: “layernorm”,
“pos-emb”: “rotary”,
“no-weight-tying”: true,

Topic		Replies	Views
Finetuning a pre-trained model Intermediate	0	57	August 21, 2024
Finetuning a Large Language Model Intermediate	0	83	October 23, 2024
When I use Trainer API to train the GLM Model and save this model，I find memory of the finetuned model is twice the size of the original model. What is the reason for this? Models	5	317	March 22, 2023
Finetuning model with smaller sequence size and Dmodel Models	0	337	April 15, 2021
Fine-tuning GPT2 Family (Small to XL), How should hyperparameters and generation criteria change? Models	0	1149	May 30, 2023

Finetuning neox 20b, why is resulting model so small

model settings

model settings

Related topics