Tips for PreTraining BERT from scratch

prajjwal1 · September 18, 2020, 3:37am

So far, I’ve been using pre-trained models. For my task, it seems that I am required to perform pre-training on GLUE task just to see how it performs. I wanted to confirm what modifications need to be done to do this ? I’m not sure about using the same tokenizer.

I want to randomly initialize it and train on GLUE task. Additionally, if you some tips on effectively doing it when not using fine-tuning weights, please share ?

valhalla · September 18, 2020, 4:33am

You can initialize a model without pre-trained weights using

from transformers import BertConfig, BertForSequenceClassification

# either load pre-trained config
config = BertConfig.from_pretrained("bert-base-cased")
#  or instantiate yourself
config = BertConfig(
    vocab_size=2048,
    max_position_embeddings=768,
    intermediate_size=2048,
    hidden_size=512,
    num_attention_heads=8,
    num_hidden_layers=6,
    type_vocab_size=5,
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    num_labels=3,
)

# pass the config to model constructor instead of from_pretrained
# this creates the model as per the params in config 
# but with weights randomly initialized
model = BertForSequenceClassification(config)

and as it’s a ForSequenceClassification model, the existing run_glue.py script can be used to train this model, just initialize model using config instead of .from_pretrained

prajjwal1 · September 18, 2020, 4:34am

Thanks for replying. Okay, so the process is mostly same. Do we need to make any changes to tokenizer ? Saw few posts where people had encountered issues during pre-training, so thought of confirming this.

prajjwal1 · September 18, 2020, 4:35am

Also any training tips when not using pre-trained weights ?

valhalla · September 18, 2020, 4:36am

If you train the tokenizer from scratch as well, then make sure to change the vocab size in config accordingly

valhalla · September 18, 2020, 4:41am

I haven’t done so myself for this task so can’t say much, but probably start with higher LR than the default one in the Trainer (which is 5e-5) as we are training from scratch, experiment with LR schedule, haparms search will definitely help choose the right params.

@sgugger might have better tips for this.

prajjwal1 · September 18, 2020, 4:46am

Is there any problem if we use AutoTokenizer.from_pretrained() ? That’s where I am unsure. Or do we use custom tokenizer ? What’s the recommended way of proceeding.

valhalla · September 18, 2020, 4:50am

You can use pre-trained tokenizer, it shouldn’t cause any issues. And IMO using pre trained tokenizer makes sense than training from scratch on limited data

prajjwal1 · September 18, 2020, 8:49am

With the standard way, BERT-base is training, BERT-large doesn’t seem to respond. I seem to be missing something in regards to training dynamics.

valhalla · September 18, 2020, 9:21am

is there any exception , error ?

BramVanroy · September 18, 2020, 9:56am

Just to add to this, training your own tokenizer is mainly useful if you are working with a specific genre, domain, and/or language.

prajjwal1 · September 18, 2020, 10:16am

No, it doesn’t learn anything. Tried with different LRs. Are there certain specific things involved when training from scratch with these large models ? BERT-base improves from 31 to 58, whereas BERT-large stays at 31.

vgoklani · September 18, 2020, 6:50pm

More generally, what sort of performance metrics should one expect to see when pre-training?

To be more specific, how long should we pre-train (days vs weeks), and what’s an acceptable loss? When should we stop? I would appreciate any references on this issue, thank you!

RichardWang · September 20, 2020, 2:18am

Perform pretraining on GLUE tasks with non-pretrained large model ?

I don’t know but multi-task learning may help ?
Additionally, since dataset size is smaller than those large text corpus for pretraining, stronger regularization (dropout, weight decay, gradient clip…) may help.

RichardWang · September 20, 2020, 2:27am

Typically GLUE / SuperGLUE in English, or finetuning on your target domain should be also ok for a model targeted for that domain I think.

Scaling Laws for Neural Language Models should help. In short, it depends on the size of model and how much cost you can afford or how much performance you want. I remembered huggingface has a calculator for this.

valhalla · September 24, 2020, 2:16pm

Hey @prajjwal1 were you able to resolve this ?

prajjwal1 · September 24, 2020, 2:31pm

I’ve opened a new issue which is about pre-training. Training on GLUE part is resolved. Thanks for asking.

valhalla · September 24, 2020, 3:19pm

Glad that it’s resolved. What kind of metrics are you getting ?

prajjwal1 · September 25, 2020, 5:27am

After experimenting, I felt as if I had replicated results of Table 1 from Revealing Dark Secrets of BERT paper.

danaludwig · December 10, 2020, 4:41am

Take this with a grain of salt, but I heard that BERT-large can’t be trained without a TPU because it has too many parameters to fit into GPU memory.

Topic		Replies	Views
Is there a way to correctly load a pre-trained transformers model without the configuration file? Beginners	6	16909	August 13, 2021
Pre-training a BERT model from scratch with custom tokenizer Intermediate	5	2935	January 11, 2022
Differences between Config.from_pretrained and Model.from_pretrained 🤗Transformers	1	1020	July 20, 2021
How do i take only "BERT" weights from BertForSequenceClassification model? 🤗Transformers	0	1406	February 16, 2022
Pre-training BERT Models	1	375	May 21, 2024

Tips for PreTraining BERT from scratch

Related topics