Tips for PreTraining BERT from scratch

So far, I’ve been using pre-trained models. For my task, it seems that I am required to perform pre-training on GLUE task just to see how it performs. I wanted to confirm what modifications need to be done to do this ? I’m not sure about using the same tokenizer.

I want to randomly initialize it and train on GLUE task. Additionally, if you some tips on effectively doing it when not using fine-tuning weights, please share ?

1 Like

You can initialize a model without pre-trained weights using

from transformers import BertConfig, BertForSequenceClassification

# either load pre-trained config
config = BertConfig.from_pretrained("bert-base-cased")
#  or instantiate yourself
config = BertConfig(
    vocab_size=2048,
    max_position_embeddings=768,
    intermediate_size=2048,
    hidden_size=512,
    num_attention_heads=8,
    num_hidden_layers=6,
    type_vocab_size=5,
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    num_labels=3,
)

# pass the config to model constructor instead of from_pretrained
# this creates the model as per the params in config 
# but with weights randomly initialized
model = BertForSequenceClassification(config)

and as it’s a ForSequenceClassification model, the existing run_glue.py script can be used to train this model, just initialize model using config instead of .from_pretrained

3 Likes

Thanks for replying. Okay, so the process is mostly same. Do we need to make any changes to tokenizer ? Saw few posts where people had encountered issues during pre-training, so thought of confirming this.

Also any training tips when not using pre-trained weights ?

If you train the tokenizer from scratch as well, then make sure to change the vocab size in config accordingly

I haven’t done so myself for this task so can’t say much, but probably start with higher LR than the default one in the Trainer (which is 5e-5) as we are training from scratch, experiment with LR schedule, haparms search will definitely help choose the right params.

@sgugger might have better tips for this.

1 Like

Is there any problem if we use AutoTokenizer.from_pretrained() ? That’s where I am unsure. Or do we use custom tokenizer ? What’s the recommended way of proceeding.

You can use pre-trained tokenizer, it shouldn’t cause any issues. And IMO using pre trained tokenizer makes sense than training from scratch on limited data

1 Like

With the standard way, BERT-base is training, BERT-large doesn’t seem to respond. I seem to be missing something in regards to training dynamics.

is there any exception , error ?

Just to add to this, training your own tokenizer is mainly useful if you are working with a specific genre, domain, and/or language.

1 Like

No, it doesn’t learn anything. Tried with different LRs. Are there certain specific things involved when training from scratch with these large models ? BERT-base improves from 31 to 58, whereas BERT-large stays at 31.

More generally, what sort of performance metrics should one expect to see when pre-training?

To be more specific, how long should we pre-train (days vs weeks), and what’s an acceptable loss? When should we stop? I would appreciate any references on this issue, thank you!

Perform pretraining on GLUE tasks with non-pretrained large model ?

I don’t know but multi-task learning may help ?
Additionally, since dataset size is smaller than those large text corpus for pretraining, stronger regularization (dropout, weight decay, gradient clip…) may help.

Typically GLUE / SuperGLUE in English, or finetuning on your target domain should be also ok for a model targeted for that domain I think.

Scaling Laws for Neural Language Models should help. In short, it depends on the size of model and how much cost you can afford or how much performance you want. I remembered huggingface has a calculator for this.

Hey @prajjwal1 were you able to resolve this ?

I’ve opened a new issue which is about pre-training. Training on GLUE part is resolved. Thanks for asking.

Glad that it’s resolved. What kind of metrics are you getting ?

After experimenting, I felt as if I had replicated results of Table 1 from Revealing Dark Secrets of BERT paper.

Take this with a grain of salt, but I heard that BERT-large can’t be trained without a TPU because it has too many parameters to fit into GPU memory.