So far, I’ve been using pre-trained models. For my task, it seems that I am required to perform pre-training on GLUE task just to see how it performs. I wanted to confirm what modifications need to be done to do this ? I’m not sure about using the same tokenizer.
I want to randomly initialize it and train on GLUE task. Additionally, if you some tips on effectively doing it when not using fine-tuning weights, please share ?
1 Like
You can initialize a model without pre-trained weights using
from transformers import BertConfig, BertForSequenceClassification
# either load pre-trained config
config = BertConfig.from_pretrained("bert-base-cased")
# or instantiate yourself
config = BertConfig(
vocab_size=2048,
max_position_embeddings=768,
intermediate_size=2048,
hidden_size=512,
num_attention_heads=8,
num_hidden_layers=6,
type_vocab_size=5,
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
num_labels=3,
)
# pass the config to model constructor instead of from_pretrained
# this creates the model as per the params in config
# but with weights randomly initialized
model = BertForSequenceClassification(config)
and as it’s a ForSequenceClassification
model, the existing run_glue.py
script can be used to train this model, just initialize model using config
instead of .from_pretrained
3 Likes
Thanks for replying. Okay, so the process is mostly same. Do we need to make any changes to tokenizer ? Saw few posts where people had encountered issues during pre-training, so thought of confirming this.
Also any training tips when not using pre-trained weights ?
If you train the tokenizer from scratch as well, then make sure to change the vocab size in config
accordingly
I haven’t done so myself for this task so can’t say much, but probably start with higher LR than the default one in the Trainer
(which is 5e-5) as we are training from scratch, experiment with LR schedule, haparms search will definitely help choose the right params.
@sgugger might have better tips for this.
1 Like
Is there any problem if we use AutoTokenizer.from_pretrained()
? That’s where I am unsure. Or do we use custom tokenizer ? What’s the recommended way of proceeding.
You can use pre-trained tokenizer, it shouldn’t cause any issues. And IMO using pre trained tokenizer makes sense than training from scratch on limited data
1 Like
With the standard way, BERT-base is training, BERT-large doesn’t seem to respond. I seem to be missing something in regards to training dynamics.
is there any exception , error ?
Just to add to this, training your own tokenizer is mainly useful if you are working with a specific genre, domain, and/or language.
1 Like
No, it doesn’t learn anything. Tried with different LRs. Are there certain specific things involved when training from scratch with these large models ? BERT-base improves from 31 to 58, whereas BERT-large stays at 31.
More generally, what sort of performance metrics should one expect to see when pre-training?
To be more specific, how long should we pre-train (days vs weeks), and what’s an acceptable loss? When should we stop? I would appreciate any references on this issue, thank you!
Perform pretraining on GLUE tasks with non-pretrained large model ?
I don’t know but multi-task learning may help ?
Additionally, since dataset size is smaller than those large text corpus for pretraining, stronger regularization (dropout, weight decay, gradient clip…) may help.
Typically GLUE / SuperGLUE in English, or finetuning on your target domain should be also ok for a model targeted for that domain I think.
Scaling Laws for Neural Language Models should help. In short, it depends on the size of model and how much cost you can afford or how much performance you want. I remembered huggingface has a calculator for this.
Hey @prajjwal1 were you able to resolve this ?
I’ve opened a new issue which is about pre-training. Training on GLUE part is resolved. Thanks for asking.
Glad that it’s resolved. What kind of metrics are you getting ?
After experimenting, I felt as if I had replicated results of Table 1 from Revealing Dark Secrets of BERT paper.
Take this with a grain of salt, but I heard that BERT-large can’t be trained without a TPU because it has too many parameters to fit into GPU memory.