Albert pre-train convergence problem

yl-to · July 23, 2020, 4:36am

My abert pre-train from scratch model can’t converge to 0 in even using wiki-Text2.

The model training loss converged at 6.6 when using AlbertForMaskedLM as model class
negative training loss when using AlbertForPretrain as model class

notice: I was deliberately set the eval dataset the same as training set for checking training loss at last run.
I also raised a issue here:

github.com/huggingface/transformers

Albert pre-train from scratch convergence problem

opened 09:34PM - 22 Jul 20 UTC

closed 11:56AM - 10 Sep 20 UTC

yl-to

# 🐛 Bug Albert pre-train convergence problem - The model training loss converg…ed at 6.6 when using AlbertForMaskedLM as model class - negative training loss when using AlbertForPretrain as model class notice: I was deliberately set the eval dataset the same as training set for checking training loss at last run. ## Information Using AlbertForMaskedLM as model class, figure showed below: ![image](https://user-images.githubusercontent.com/23205976/88234831-7ca56f00-cc2e-11ea-8ab7-75ce3e841b88.png) Using AlbertForPretrain as model class, figure showed below: ![image](https://user-images.githubusercontent.com/23205976/88230876-0867cd00-cc28-11ea-90aa-a7412dd9c0c6.png) Besides, when I was using the official `run_lanugage_modeling.py`, the training loss on wikiText-2 is also not converge to 0, it converged at 6.6 for several epochs. Model I am using (Bert, XLNet ...): Albert Language I am using the model on (English, Chinese ...): English The problem arises when using: * [x] the official example scripts: (give details below) * [x] my own modified scripts: (give details below) The tasks I am working on is: * [x] an official task: wikiText-2 * [ ] my own task or dataset: (give details below) ## To reproduce ``` from transformers import ( AlbertConfig, AlbertTokenizer, BertTokenizer, AlbertForPreTraining, AlbertForMaskedLM, LineByLineTextDataset, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments ) import math albert_base_configuration = AlbertConfig( hidden_size=768, num_attention_heads=12, intermediate_size=3072, ) tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2') # model = AlbertForPreTraining(config=albert_base_configuration) model = AlbertForMaskedLM(config=albert_base_configuration) train_dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path="/home/ubuntu/data_local/wikitext-2-raw/wiki.train.raw", block_size=512, ) eval_dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path="/home/ubuntu/data_local/wikitext-2-raw/wiki.test.raw", block_size=512, ) data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=True, mlm_probability=0.15 ) training_args = TrainingArguments( output_dir="./results/new", overwrite_output_dir=True, num_train_epochs=5, per_gpu_train_batch_size=5, save_steps=10_000, save_total_limit=1, logging_steps=100, learning_rate=1.76e-3 ) trainer = Trainer( model=model, args=training_args, data_collator=data_collator, train_dataset=train_dataset, eval_dataset=eval_dataset, prediction_loss_only=True, ) trainer.train() trainer.save_model("./results/new") eval_output = trainer.evaluate() perplexity = math.exp(eval_output["eval_loss"]) print({"loss": eval_output["eval_loss"]}) result = {"perplexity": perplexity} print(result) ``` Steps to reproduce the behavior: 1.Download Albert here https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/ 2.Run the script  ## Expected behavior - The training loss should converge to 0 in tiny datasets. - Cross entropy should be always positive and eventually converge to zero. ## Environment info  - `transformers` version: v3.02 - Platform: AWS instance - Python version: 3.7.7 - PyTorch version (GPU?): 1.5.1 with GPU - Tensorflow version (GPU?): None - Using GPU in script?: Yes, 8 GPU in total - Using distributed or parallel set-up in script? Not sure.

sxdmit · December 6, 2024, 11:43am

i believe it might due to your inappropriate setting on lr or the size of your dataset, you should take a careful look into your training

Topic		Replies	Views
Albert LM on WikiText2 🤗Transformers	0	774	July 27, 2020
Are albert-base-v1( and v2) pretrained enough? 🤗Transformers	4	354	October 26, 2021
Worse performance when training with albert-large Beginners	0	255	April 16, 2021
Loss becoming nearly zero in first 5K steps when training LM from scratch 🤗Transformers	10	2427	March 18, 2023
BertForMaskedLM training from scratch not converging 🤗Transformers	0	253	July 18, 2023

Albert pre-train convergence problem

Related topics