Finetuning neox 20b, why is resulting model so small

Hello, I am training neox 20B on a custom text corpus. It seems when you use the train.py function, it makes a whole new model that is much smaller than the original 39gb slimmed weights model. I had the “finetune” config in yml set to true, though the documentation suggests finetune would automatically be set to true anyway when training based on a “release model”. Which I am supposing means, the slimmed down 20B weights that I downloaded.

For example using the example small.yml, I trained for 5000 iterations, and it outputted a checkpoint consisting of a 2g size model that can be loaded all by itself and used to generate outputs:

2223080 ./global_step5000

I was able to generate inference results using this checkpoint.

Now with fine tuning, my understanding is that we were supposedly training only the output “head” while leaving the remainder of the model unchanged. If that’s the case, how is the resultant model only 2g? Shouldn’t it be whatever size the non-head portion of the original model was (idk… maybe 30 g?) and then the new head, which would be a lot smaller? So I dont see how the whole new model fits into 2g.

Does that 2g model contain a slimmed down representation of what the original 39gb model was? I dont think so… it seems it is making a whole new model. There was no reference to the original model in any of the config files so I think it has nothing to work with for inference other than those 2 gigabytes of weights.

My theories are, either I am doing it wrong (very possible), or the “fine-tuning” process that is occurring in this case differs from my understanding.

Here are some of my configs that may be relevant:

“data-path”: “/home/ubuntu/data/train_text_document”,
“vocab-file”: “data/gpt2-vocab.json”,
“merge-file”: “data/gpt2-merges.txt”,
“finetune”: True,
“save”: “checkpoints”,
“load”: “checkpoints”,
“checkpoint_validation_with_forward_pass”: False,
“pipe-parallel-size”: 4,
“model-parallel-size”: 2,

model settings

“num-layers”: 12,
“hidden-size”: 768,
“num-attention-heads”: 12,
“seq-length”: 2048,
“max-position-embeddings”: 2048,
“norm”: “layernorm”,
“pos-emb”: “rotary”,
“no-weight-tying”: true,

Thanks for any help!

Also wondering, about how large a model is created by the “large.yml” settings, which are as follows. The documentation indicated that the “small.yml” settings yielded a model of about 160M parameters; not sure how that calculation is done and what the similar calculation would be for “large.yml”.

model settings

“num-layers”: 24,
“hidden-size”: 1536,
“num-attention-heads”: 16,
“seq-length”: 2048,
“max-position-embeddings”: 2048,
“norm”: “layernorm”,
“pos-emb”: “rotary”,
“no-weight-tying”: true,