Dear All,
I am finetuing BERT model for the sequence to sequence binary text classification task for the Arabic language. For that purpose I am using aubmindlab/bert-base-arabertv02 checkpoint .
The code is written to support multi-checkpoint usage using the AutoTokenizer class, and the AutoModelForSequenceClassification, as I am intended to experiment with more than one checkpoint. The dataset I am using is an imbalanced dataset, so I am using a custom Trainer with the weighted class as suggested by Hugggingface docs and tutorials.
I need to understand a couple of things to make sure I am doing well.
-
In the training args I provided my Hugging face model repo to set up an early stopping callback, so every time I am doing an experiment, the model is being loaded from the repo instead of downloading the model from the provided checkpoint, please refer to the image. I am wondering does this make the bias for the results of the model after being trained?!
-
As mentioned earlier, the task is binary classification with [ 0,1 ] classes. In the custom Trainer, I used CrossEntropyLoss to calculate the loss function, am I using the right loss function?
-
I have a slightly large dataset with [1,093,402] records. I split it into Train and Validate data set using Stratified K fold because it is imbalanced into (988,789) for training, and (104,613) for validating the model. Besides, I have a small separate hold-out test dataset (15,206 records) from another source that I am using for comparing results. Does the small size of the test dataset affect the model performance, as I am not getting as good results as the validation ones? Besides, Should I use the trainer.evaluate or trainer.predict method to evaluate the final performance of the pre-trained model??
-
After using CustomTrainer with weighted loss, Do I have to consider the dataset as if it is balanced when it comes to choosing the metrics (i.e accuracy), or do I have to stick with the macro f1 score because it is imbalanced?
-
Does using the code implemntaion is enough toovercome overfetting ?
My Dataset Description :
DatasetDict({
train: Dataset({
features: ['Unnamed: 0', 'WikidataArabicDescrption', 'WikipediaArabicDescrption', 'label'],
num_rows: 988789
})
valid: Dataset({
features: ['Unnamed: 0', 'WikidataArabicDescrption', 'WikipediaArabicDescrption', 'label'],
num_rows: 104613
})
test: Dataset({
features: ['Unnamed: 0', 'WikidataArabicDescrption', 'WikipediaArabicDescrption', 'label'],
num_rows: 15206
})
})
My Training Args:
# Defining the TrainingArguments() arguments
logging_steps = len(my_dataset['train']) // hyperparameter_defaults['batch_size']
training_args = TrainingArguments(
f"aomar85/WikiBERT",
num_train_epochs = hyperparameter_defaults['num_train_epochs'],
evaluation_strategy = hyperparameter_defaults['evaluation_strategy'],
# evaluation_strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"no"`):
# The evaluation strategy to adopt during training. Possible values are:
# - `"no"`: No evaluation is done during training.
# - `"steps"`: Evaluation is done (and logged) every `eval_steps`.
# - `"epoch"`: Evaluation is done at the end of each epoch.
eval_steps = hyperparameter_defaults['eval_steps'], # Evaluation and Save happens every 50 steps
save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
logging_strategy = hyperparameter_defaults['logging_strategy'],
# logging_steps (default 500): Number of update steps between two logs if
# logging_strategy="steps".
#logging_steps=logging_steps,
# save_strategy (default "steps"):
# The checkpoint save strategy to adopt during training. Possible values are:
# "no": No save is done during training.
# "epoch": Save is done at the end of each epoch.
# "steps": Save is done every save_steps (default 500).
save_strategy= hyperparameter_defaults['save_strategy'],
# save_steps (default: 500): Number of updates steps before two checkpoint
# saves if save_strategy="steps".
save_steps=hyperparameter_defaults['save_steps'],
#run_name = run_name,
disable_tqdm = False, # added by me based on trainer glosses
seed = hyperparameter_defaults['seed'], # added by me based on trainer glosses
learning_rate=hyperparameter_defaults['learning_rate'],
# learning_rate (default 5e-5): The initial learning rate for AdamW optimizer.
# Adam algorithm with weight decay fix as introduced in the paper
# Decoupled Weight Decay Regularization.
#lr_scheduler_type = 'cosine', # added by me based on trainer glosses
per_device_train_batch_size=hyperparameter_defaults['batch_size'],
per_device_eval_batch_size=hyperparameter_defaults['batch_size'],
weight_decay=hyperparameter_defaults['weight_decay'],
# weight_decay (`float`, *optional*, defaults to 0):
# The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in [`AdamW`]
# optimizer.
fp16 =True,
push_to_hub=True,
# load_best_model_at_end (default False): Whether or not to load the best model
# found during training at the end of training.
metric_for_best_model = hyperparameter_defaults['metric_for_best_model'], # 'f1' #eval_loss eval_loss
# metric_for_best_model:
# Use in conjunction with load_best_model_at_end to specify the metric to use
# to compare two different models. Must be the name of a metric returned by
# the evaluation with or without the prefix "eval_".
#If you set this value, `greater_is_better` will default to `True`. Don't forget to set it to `False` if
#your metric is better when lower.
#greater_is_better = False ,
load_best_model_at_end = True,
#load_best_model_at_end (`bool`, *optional*, defaults to `False`):
#Whether or not to load the best model found during training at the end of training.
report_to="all"
# report_to:
# The list of integrations to report the results and logs to. Supported
# platforms are "azure_ml", "comet_ml", "mlflow", "tensorboard" and "wandb".
# Use "all" to report to all integrations installed, "none" for no integrations.
)
The Custom Trainer :
class WeightedLossTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
# forward pass
# Feed inputs to model and extract logits
outputs = model(**inputs)
logits = outputs.get("logits")
# Extract labels
labels = inputs.get("labels")
# compute custom loss (suppose one has 2 labels with different weights)
loss_func = nn.CrossEntropyLoss(weight=class_weights)
# compute loss
loss = loss_func(logits, labels)
return (loss, outputs) if return_outputs else loss
The class weights:
class_weights= (1- (train_df['label'].value_counts().sort_index() / len(train_df['label']))).values
class_weights = torch.from_numpy(class_weights).float().to("cuda").
My Trainer instance:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
trainer = WeightedLossTrainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["valid"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
callbacks = [EarlyStoppingCallback(early_stopping_patience=3) ]
)
Thanks in advance