I am training a BERT model from scratch with my own corpus using this blog post: https://huggingface.co/blog/how-to-train
# Define config for the model from transformers import BertConfig, BertForMaskedLM, Trainer, TrainingArguments config = BertConfig( vocab_size=32000, max_position_embeddings=1024, num_attention_heads=12, num_hidden_layers=12, type_vocab_size=2, hidden_act= "gelu", intermediate_size= 3072, hidden_dropout_prob= 0.1, hidden_size= 768, initializer_range= 0.02, attention_probs_dropout_prob= 0.1, ) model = BertForMaskedLM(config=config) training_args = TrainingArguments( output_dir="./bert", overwrite_output_dir=True, num_train_epochs=5, per_device_train_batch_size=32, per_device_eval_batch_size=32, save_steps=1000, save_total_limit=2, do_train=True, do_eval=True, logging_steps=1000, eval_steps = None, prediction_loss_only=True, ) trainer = Trainer( model=model, args=training_args, data_collator=data_collator, train_dataset=train_dataset, eval_dataset=eval_dataset, ) # Start training trainer.train()
I have some questions upon finishing training the model if anyone could help to shed light on it:
- How can I evaluate the performance of my unsupervised trained model? Is there a way to see the validation loss or the perplexity score?. At the moment, this is the only thing I see:
do_eval in the training arguments but not sure what would the next steps be to obtain the model performance. My train set is 95% while val set is 5% of the corpus.
For now, I trained the model on 5 epochs. Hypothetically if I want to keep training the modes on 5 more epochs (after the first 5 epochs have been done), what would be the best way to continue to train them without having to train them from scratch again? Do I run the
My corpus is just about 1.5GB - what would be the ideal number of epoch to train on?
Apologize for multiple questions but I’m quite new to Deep learning and Language model so I feel like I’m missing a big picture here. Many thanks in advance for your insights!