I guess you need to use SFTTrainer or some other way to train LLM from scratch . Updating tokenizer doesn’t change the LLM . Try to see how much OOD tokens you get . Might be pre-trained model still can be good with updating tokenizer and LLM embedding layer.
The reference you use shows pretty well how to tune Llama 2 on mlabonne/guanaco-llama2-1k data . Here you rather use you data. It means authors doesn’t rely on out-of-box Llama 2 model and decided to finetune on custom data. I don’t see some mention of unsupervised training there. The code you bring here is actually a generation/inference code.
The subsection title “Unsupervised Pre-training with Base Model Llama 2” seems a little misguiding. All that the code in that subsection does is test model generation.
It has sample code you can follow, but please note that the guide doesn’t make use of PEFT methods (which basically means that training will be more computationally expensive). You can add the code for that later if you want.