Opinion: Training Argument Fine Tuning MLM RoBERTa

Hi,
i want to ask an opinion about the training arguments that i have setted to do a fine tuning of RoBERTa Pre Trained Model for MLM ( fill mask ) use.
I want to know if the argument are correct for my case (MLM use) or if there is some that i can do better.

This is my code:

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True, 
    mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir="./results",
    save_strategy="steps",
    eval_strategy="steps",
    save_steps=500,
    eval_steps=500,
    learning_rate=2e-5,
    per_device_train_batch_size=24,
    per_device_eval_batch_size=24,
    num_train_epochs=3,
    weight_decay=0.01,
    max_grad_norm=1.0,
    logging_dir="./logs",
    fp16=True,
    gradient_accumulation_steps=4,
    eval_accumulation_steps=24,
    logging_steps=100,
    warmup_steps=1000,
    save_total_limit=2,
    greater_is_better=False,
    load_best_model_at_end=True,
    overwrite_output_dir=True,
    optim="adamw_torch"
)

class MyTrainer(Trainer):
    def training_step(self, model, inputs, optimizer):
        outputs = super().training_step(model, inputs, optimizer)
        torch.cuda.empty_cache()
        return outputs


trainer = MyTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_test,
    data_collator=data_collator,
    tokenizer=tokenizer
)

Now i’m using a pretained RoBERTa Model with vocab of 52000 with a Dataset of 200.000 row. ( 80% train and 20% test )

Thank you

1 Like

Hi, @Cicciokr !
Your training arguments look well thought-out for fine-tuning RoBERTa on a Masked Language Modeling (MLM) task, but there are a few areas that could be adjusted or optimized for better performance. I’ll go through them one by one.

1. Data Collator:

  • The DataCollatorForLanguageModeling with mlm=True and mlm_probability=0.15 is a good setup for MLM, which randomly masks 15% of the tokens in the input. This is standard for MLM tasks and should work fine. You can experiment with slightly higher or lower values for mlm_probability depending on the results, but 0.15 is a reasonable choice.

2. Training Arguments:

  • learning_rate=2e-5: This is a typical learning rate for fine-tuning transformer models, and it should be a good starting point. You may need to adjust it based on training stability or performance. If you notice the model is converging too slowly or quickly, you could consider using a learning rate scheduler.

  • per_device_train_batch_size=24 and per_device_eval_batch_size=24: These batch sizes are reasonable, but if you run into memory issues, consider lowering them. Since you’re using fp16=True (mixed precision training), you can often get away with slightly larger batch sizes.

  • num_train_epochs=3: This is a standard number of epochs for fine-tuning, but the optimal number depends on your dataset and training progress. You can monitor the model’s performance and adjust accordingly. If you see overfitting, consider using early stopping or reducing epochs.

  • weight_decay=0.01: This is a standard weight decay value, which helps prevent overfitting. You could experiment with lower or higher values depending on your results.

  • max_grad_norm=1.0: Gradient clipping is a good practice, and a value of 1.0 is typical. If you experience exploding gradients, you can try lowering this value.

  • fp16=True: Using mixed-precision training (fp16) can speed up training and reduce memory usage, which is excellent. Ensure that your GPU supports FP16 (e.g., Volta or Turing architecture).

  • gradient_accumulation_steps=4 and eval_accumulation_steps=24: These values are useful for managing large batch sizes without running out of memory. If you’re running into memory issues, consider reducing gradient_accumulation_steps.

  • warmup_steps=1000: This is a good practice to avoid a sudden large update at the start of training. You may want to adjust this depending on your dataset size; typically, 10% of the total training steps is a good rule of thumb for warmup steps.

  • save_steps=500 and eval_steps=500: These values seem reasonable for saving and evaluating the model frequently. If training is long, you can reduce these to save checkpoints more often, but too frequent saving can slow down training. You may want to adjust based on the training duration.

  • overwrite_output_dir=True: This ensures that the output directory is overwritten with each run. Make sure this is what you want, as it will discard previous results unless you specify different output directories.

  • optim="adamw_torch": AdamW is a good optimizer for transformer models. You could also try experimenting with the learning rate scheduler and possibly adjusting the beta1 and beta2 hyperparameters if you face issues with convergence.

3. Trainer Class:

  • MyTrainer: It’s a good idea to clear the CUDA cache after each training step with torch.cuda.empty_cache() if memory usage is a concern, especially for large models. Ensure that this doesn’t interfere with your training by monitoring the GPU memory usage. If the memory footprint increases over time, you may want to revisit this approach.

  • load_best_model_at_end=True: This is a good strategy to ensure that the best model is retained after training based on evaluation metrics. Make sure your eval_strategy is set properly (i.e., steps or epoch), depending on how often you want to evaluate.

4. Miscellaneous:

  • Dataset Size: With a dataset of 200,000 rows, you have a decent amount of data for fine-tuning. Ensure that your dataset is well-tokenized, and consider using a dynamic padding strategy if sequences vary greatly in length to reduce padding overhead.

  • Evaluation Metrics: For MLM, you’ll want to monitor perplexity or masked token accuracy to track performance during training. If you’re using custom evaluation metrics, make sure they align with the MLM objective.

Final Recommendations:

  • Early Stopping: You might want to implement early stopping to prevent overfitting and save training time. You can monitor validation loss or another metric of your choice.
  • Learning Rate Scheduling: Using a scheduler like get_scheduler("linear") or get_scheduler("cosine") can adjust the learning rate during training and help with convergence.
  • Data Shuffling: Ensure that your training data is shuffled well to prevent any biases or correlations in the data from affecting model performance.

Overall, your training arguments look solid. Adjustments can be made based on experimentation and monitoring model performance over time. Let me know if you need further help!

1 Like