Questions about my first code on fine-tuning BERT model for text-classification

Dear All,

I am finetuing BERT model for the sequence to sequence binary text classification task for the Arabic language. For that purpose I am using aubmindlab/bert-base-arabertv02 checkpoint .

The code is written to support multi-checkpoint usage using the AutoTokenizer class, and the AutoModelForSequenceClassification, as I am intended to experiment with more than one checkpoint. The dataset I am using is an imbalanced dataset, so I am using a custom Trainer with the weighted class as suggested by Hugggingface docs and tutorials.

I need to understand a couple of things to make sure I am doing well.

  • In the training args I provided my Hugging face model repo to set up an early stopping callback, so every time I am doing an experiment, the model is being loaded from the repo instead of downloading the model from the provided checkpoint, please refer to the image. I am wondering does this make the bias for the results of the model after being trained?!

  • As mentioned earlier, the task is binary classification with [ 0,1 ] classes. In the custom Trainer, I used CrossEntropyLoss to calculate the loss function, am I using the right loss function?

  • I have a slightly large dataset with [1,093,402] records. I split it into Train and Validate data set using Stratified K fold because it is imbalanced into (988,789) for training, and (104,613) for validating the model. Besides, I have a small separate hold-out test dataset (15,206 records) from another source that I am using for comparing results. Does the small size of the test dataset affect the model performance, as I am not getting as good results as the validation ones? Besides, Should I use the trainer.evaluate or trainer.predict method to evaluate the final performance of the pre-trained model??

  • After using CustomTrainer with weighted loss, Do I have to consider the dataset as if it is balanced when it comes to choosing the metrics (i.e accuracy), or do I have to stick with the macro f1 score because it is imbalanced?

  • Does using the code implemntaion is enough toovercome overfetting ?

My Dataset Description :

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'WikidataArabicDescrption', 'WikipediaArabicDescrption', 'label'],
        num_rows: 988789
    })
    valid: Dataset({
        features: ['Unnamed: 0', 'WikidataArabicDescrption', 'WikipediaArabicDescrption', 'label'],
        num_rows: 104613
    })
    test: Dataset({
        features: ['Unnamed: 0', 'WikidataArabicDescrption', 'WikipediaArabicDescrption', 'label'],
        num_rows: 15206
    })
})

My Training Args:

# Defining the TrainingArguments() arguments

logging_steps = len(my_dataset['train']) // hyperparameter_defaults['batch_size']

training_args = TrainingArguments(

   f"aomar85/WikiBERT",

   num_train_epochs = hyperparameter_defaults['num_train_epochs'],

   evaluation_strategy = hyperparameter_defaults['evaluation_strategy'],

   

    #       evaluation_strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"no"`):

    #        The evaluation strategy to adopt during training. Possible values are:

    #            - `"no"`: No evaluation is done during training.

    #            - `"steps"`: Evaluation is done (and logged) every `eval_steps`.

    #            - `"epoch"`: Evaluation is done at the end of each epoch.

   eval_steps = hyperparameter_defaults['eval_steps'], # Evaluation and Save happens every 50 steps

   save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.

   logging_strategy = hyperparameter_defaults['logging_strategy'],

    # logging_steps (default 500): Number of update steps between two logs if

    # logging_strategy="steps".

   #logging_steps=logging_steps,

    # save_strategy (default "steps"):

    # The checkpoint save strategy to adopt during training. Possible values are:

    # "no": No save is done during training.

    # "epoch": Save is done at the end of each epoch.

    # "steps": Save is done every save_steps (default 500).

   save_strategy= hyperparameter_defaults['save_strategy'],

    # save_steps (default: 500): Number of updates steps before two checkpoint

    # saves if save_strategy="steps".

   save_steps=hyperparameter_defaults['save_steps'],

   #run_name = run_name,

   disable_tqdm = False, # added by me based on trainer glosses

   seed = hyperparameter_defaults['seed'], # added by me based on trainer glosses

   learning_rate=hyperparameter_defaults['learning_rate'],

   # learning_rate (default 5e-5): The initial learning rate for AdamW optimizer.

    # Adam algorithm with weight decay fix as introduced in the paper

    # Decoupled Weight Decay Regularization.

   #lr_scheduler_type = 'cosine', # added by me based on trainer glosses

   per_device_train_batch_size=hyperparameter_defaults['batch_size'],

   per_device_eval_batch_size=hyperparameter_defaults['batch_size'],

   weight_decay=hyperparameter_defaults['weight_decay'],

   

    #       weight_decay (`float`, *optional*, defaults to 0):

    #        The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in [`AdamW`]

    #        optimizer.

   fp16 =True, 

   push_to_hub=True,

    # load_best_model_at_end (default False): Whether or not to load the best model

    # found during training at the end of training.

   metric_for_best_model = hyperparameter_defaults['metric_for_best_model'], # 'f1'  #eval_loss eval_loss

    # metric_for_best_model:

    # Use in conjunction with load_best_model_at_end to specify the metric to use

    # to compare two different models. Must be the name of a metric returned by

    # the evaluation with or without the prefix "eval_".

    #If you set this value, `greater_is_better` will default to `True`. Don't forget to set it to `False` if

    #your metric is better when lower.

   #greater_is_better = False , 

   load_best_model_at_end = True,

    #load_best_model_at_end (`bool`, *optional*, defaults to `False`):

    #Whether or not to load the best model found during training at the end of training.

   report_to="all"

    # report_to:

    # The list of integrations to report the results and logs to. Supported

    # platforms are "azure_ml", "comet_ml", "mlflow", "tensorboard" and "wandb".

    # Use "all" to report to all integrations installed, "none" for no integrations.

   )

The Custom Trainer :

class WeightedLossTrainer(Trainer):

    def compute_loss(self, model, inputs, return_outputs=False):

             

        # forward pass

        # Feed inputs to model and extract logits

        outputs = model(**inputs)

        logits = outputs.get("logits")

        # Extract labels

        labels = inputs.get("labels")

        # compute custom loss (suppose one has 2 labels with different weights)

        loss_func = nn.CrossEntropyLoss(weight=class_weights)

        # compute loss

        loss = loss_func(logits, labels)

        return (loss, outputs) if return_outputs else loss

The class weights:

class_weights= (1- (train_df['label'].value_counts().sort_index() / len(train_df['label']))).values
class_weights = torch.from_numpy(class_weights).float().to("cuda").

My Trainer instance:

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = WeightedLossTrainer(

    model,

    training_args,

    train_dataset=tokenized_datasets["train"],

    eval_dataset=tokenized_datasets["valid"],

    data_collator=data_collator,

    tokenizer=tokenizer,

    compute_metrics=compute_metrics,

    callbacks = [EarlyStoppingCallback(early_stopping_patience=3) ]

)

Thanks in advance