Is this the correct way to perform an unsupervised training for LLM?

al-z · May 2, 2024, 7:50am

I’m new to hugging face and training LLM in general. Please bear with me.

I’m following this tutorial for training a model (NousResearch/Llama-2-7b-chat-hf): A Comprehensive Guide to LLM Training

According to the article, after loading the model and tokenizer, you do unsupervised training by doing this:

model_inputs = unsupervised_tokenizer(training_data, return_tensors="pt", padding=True).to("cuda")
generated_ids = unsupervised_model.generate(**model_inputs,max_new_tokens=50)
unsupervised_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

I followed the steps for unsupervised training, but when I created a pipeline:

unsupervised_pipe = pipeline(task="text-generation", model=unsupervised_model, tokenizer=unsupervised_tokenizer, max_length=50)

and passed a question to it, the model created its own response and did not base its answer on the training_data.

Is this actually the correct way of training it? Or Do I have to use something like the SFTTrainer?

nimatov · May 2, 2024, 3:36pm

I guess you need to use SFTTrainer or some other way to train LLM from scratch . Updating tokenizer doesn’t change the LLM . Try to see how much OOD tokens you get . Might be pre-trained model still can be good with updating tokenizer and LLM embedding layer.

al-z · May 3, 2024, 12:21am

Thank you for your input.

Do you have any resources that I could use as reference for unsupervised training?

nimatov · May 3, 2024, 1:10pm

The reference you use shows pretty well how to tune Llama 2 on mlabonne/guanaco-llama2-1k data . Here you rather use you data. It means authors doesn’t rely on out-of-box Llama 2 model and decided to finetune on custom data. I don’t see some mention of unsupervised training there. The code you bring here is actually a generation/inference code.

Chahnwoo · May 7, 2024, 12:26am

The subsection title “Unsupervised Pre-training with Base Model Llama 2” seems a little misguiding. All that the code in that subsection does is test model generation.

If you’re looking for a way to perform unsupervised training specifically, take a look at this guide: How to Train BERT for Masked Language Modeling Tasks

It has sample code you can follow, but please note that the guide doesn’t make use of PEFT methods (which basically means that training will be more computationally expensive). You can add the code for that later if you want.

Topic		Replies	Views
Guidance on getting started with fine tuned uncensored model Beginners	2	1141	March 8, 2025
Understanding regarding "Question Answering model using open-source LLM" Beginners	0	1022	May 3, 2023
Benchmarking LLMs 🤗Transformers	1	1392	August 20, 2024
Non-Coder Training Question Beginners	5	423	February 20, 2025
Training BERT from scratch (MLM+NSP) on a new domain 🤗Transformers	10	6125	February 2, 2024

Is this the correct way to perform an unsupervised training for LLM?

Related topics