Is this the correct way to perform an unsupervised training for LLM?

I’m new to hugging face and training LLM in general. Please bear with me.

I’m following this tutorial for training a model (NousResearch/Llama-2-7b-chat-hf): A Comprehensive Guide to LLM Training

According to the article, after loading the model and tokenizer, you do unsupervised training by doing this:

model_inputs = unsupervised_tokenizer(training_data, return_tensors="pt", padding=True).to("cuda")
generated_ids = unsupervised_model.generate(**model_inputs,max_new_tokens=50)
unsupervised_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

I followed the steps for unsupervised training, but when I created a pipeline:

unsupervised_pipe = pipeline(task="text-generation", model=unsupervised_model, tokenizer=unsupervised_tokenizer, max_length=50)

and passed a question to it, the model created its own response and did not base its answer on the training_data.

Is this actually the correct way of training it? Or Do I have to use something like the SFTTrainer?

2 Likes

I guess you need to use SFTTrainer or some other way to train LLM from scratch . Updating tokenizer doesn’t change the LLM . Try to see how much OOD tokens you get . Might be pre-trained model still can be good with updating tokenizer and LLM embedding layer.

Thank you for your input.

Do you have any resources that I could use as reference for unsupervised training?

The reference you use shows pretty well how to tune Llama 2 on mlabonne/guanaco-llama2-1k data . Here you rather use you data. It means authors doesn’t rely on out-of-box Llama 2 model and decided to finetune on custom data. I don’t see some mention of unsupervised training there. The code you bring here is actually a generation/inference code.

1 Like

The subsection title “Unsupervised Pre-training with Base Model Llama 2” seems a little misguiding. All that the code in that subsection does is test model generation.

If you’re looking for a way to perform unsupervised training specifically, take a look at this guide: How to Train BERT for Masked Language Modeling Tasks

It has sample code you can follow, but please note that the guide doesn’t make use of PEFT methods (which basically means that training will be more computationally expensive). You can add the code for that later if you want.