I am training a BERT model from scratch with my own corpus using this blog post: https://huggingface.co/blog/how-to-train
# Define config for the model
from transformers import BertConfig, BertForMaskedLM, Trainer, TrainingArguments
config = BertConfig(
vocab_size=32000,
max_position_embeddings=1024,
num_attention_heads=12,
num_hidden_layers=12,
type_vocab_size=2,
hidden_act= "gelu",
intermediate_size= 3072,
hidden_dropout_prob= 0.1,
hidden_size= 768,
initializer_range= 0.02,
attention_probs_dropout_prob= 0.1,
)
model = BertForMaskedLM(config=config)
training_args = TrainingArguments(
output_dir="./bert",
overwrite_output_dir=True,
num_train_epochs=5,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
save_steps=1000,
save_total_limit=2,
do_train=True,
do_eval=True,
logging_steps=1000,
eval_steps = None,
prediction_loss_only=True,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
# Start training
trainer.train()
I have some questions upon finishing training the model if anyone could help to shed light on it:
- How can I evaluate the performance of my unsupervised trained model? Is there a way to see the validation loss or the perplexity score?. At the moment, this is the only thing I see:
TrainOutput(global_step=804310, training_loss=2.2400301857170966)
I set eval_dataset
, do_eval
in the training arguments but not sure what would the next steps be to obtain the model performance. My train set is 95% while val set is 5% of the corpus.
-
For now, I trained the model on 5 epochs. Hypothetically if I want to keep training the modes on 5 more epochs (after the first 5 epochs have been done), what would be the best way to continue to train them without having to train them from scratch again? Do I run the
trainer.train()
again? -
My corpus is just about 1.5GB - what would be the ideal number of epoch to train on?
Apologize for multiple questions but I’m quite new to Deep learning and Language model so I feel like I’m missing a big picture here. Many thanks in advance for your insights!