Retrain from scratch a model in a loop

Hello all,

I am currently exploring the influence of some architectures and hyperparameters in my specific task. Thus, I created a loop to train several times the same model (with the same set of hyperparameters but reinitialized weights) to see if model performance is statistically significant when compared to others. However, even if I set the parameter force_download = True when downloading the pretrained model (with the method from_pretrained), my experiments seem to be throwing the same results each time after the first iteration (see image below).

So, do you have any insights of what I am doing wrong? How to create a loop that will train a new model from scratch each time?

Thank you,

J.

Could you be loading from a checkpoint somewhere perhaps?

How would force download help? Please post the relevant part of your code, I.E. Where you loop and load the model.

Hello Chris, thanks for you answer. Normally, I am not loading any checkpoint. The only time where I’m loading the model is with the following command:

    model = CamembertForSequenceClassification.from_pretrained(
    "camembert-base", #12-layer CamemBERT model
    num_labels = Num_classes, # Number of predicted classes
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
    force_download = True 
) 

As you can see, I am using “force_download = True” and I’m not calling “from_pretrained” in any other side.

Also post the loop that you are using.

Hello Bram,

My loop is pretty messy right now … but I’ll post it all anyways. To help you understand it, I’m taking r readings n number of times (number of samples). My idea is to take the average of the readings for each sample, to have data folllowing (hopefully) a normal distribution. I hope this helps and it is not too overwhelming, otherwise, feel free to ask any question!

from time import gmtime, strftime

base_path = '/gdrive/My Drive/Colab Notebooks/Recherche/Journals/Third paper/Models/Spie/Arret Prod/Plain/Results'

Number_of_samples = 30

Number_of_readings = 5

for sample in range(Number_of_samples):

  print('Training for sample ', sample)

  

  sample_list = []

  reading_list = []

  accuracy_list = []

  sensitivity_list = []

  specificity_list = []

  F1_list = []

  MCC_list = []

  training_time_list = []

  filename = strftime("%Y-%m-%d %H:%M:%S", gmtime())

  global_path = base_path + '/' + filename + '.csv'

  results_dict = {'Sample':[0]*5, 'Reading':[0]*5, 'Accuracy':[0]*5, 'Sensitivity':[0]*5, 

                  'Specificity':[0]*5, 'F1 Score':[0]*5, 'MCC':[0]*5}

  results_summary = pd.DataFrame(data = results_dict)

  results_summary = results_summary.fillna(0) # with 0s rather than NaNs

  for reading in range(Number_of_readings):

    print('training for reading ', reading + 1)

    

    from transformers import CamembertForSequenceClassification, AdamW, CamembertConfig

    # Load CamembertForSequenceClassification, CamemBERT Model transformer with a 

    # sequence classification/regression 

    # head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.

    print('\n Loading the model at the begining \n')

    model = CamembertForSequenceClassification.from_pretrained(

        "camembert-base", #12-layer CamemBERT model

        num_labels = Num_classes, # Number of predicted classes

        output_attentions = False, # Whether the model returns attentions weights.

        output_hidden_states = False,

        force_download = True # Whether the model returns all hidden-states.

    ) 

      

    # Tell pytorch to run this model on the GPU.

    model.cuda()

    # Note: AdamW is a class from the huggingface library (as opposed to pytorch) 

    # I believe the 'W' stands for 'Weight Decay fix"

    optimizer = AdamW(model.parameters(),

                      lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5

                      eps = 1e-8 # args.adam_epsilon  - default is 1e-8.

                    )

    from transformers import get_linear_schedule_with_warmup

    # Number of training epochs. The BERT authors recommend between 2 and 4. 

    # We chose to run for 4, but we'll see later that this may be over-fitting the

    # training data.

    epochs = 3

    # Total number of training steps is [number of batches] x [number of epochs]. 

    # (Note that this is not the same as the number of training samples).

    total_steps = len(train_dataloader) * epochs

    # Create the learning rate scheduler.

    scheduler = get_linear_schedule_with_warmup(optimizer, 

                                                num_warmup_steps = 0, # Default value in run_glue.py

                                                num_training_steps = total_steps)

    import numpy as np

    # Function to calculate the accuracy of our predictions vs labels

    def flat_accuracy(preds, labels):

        pred_flat = np.argmax(preds, axis=1).flatten()

        labels_flat = labels.flatten()

        return np.sum(pred_flat == labels_flat) / len(labels_flat)

    import time

    import datetime

    def format_time(elapsed):

        '''

        Takes a time in seconds and returns a string hh:mm:ss

        '''

        # Round to the nearest second.

        elapsed_rounded = int(round((elapsed)))

        

        # Format as hh:mm:ss

        return str(datetime.timedelta(seconds=elapsed_rounded))

    import random

    import numpy as np

    # This training code is based on the `run_glue.py` script here:

    # https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

    # Set the seed value all over the place to make this reproducible.

    seed_val = 42

    random.seed(seed_val)

    np.random.seed(seed_val)

    torch.manual_seed(seed_val)

    torch.cuda.manual_seed_all(seed_val)

    # We'll store a number of quantities such as training and validation loss, 

    # validation accuracy, and timings.

    training_stats = []

    # Measure the total training time for the whole run.

    total_t0 = time.time()

    # For each epoch...

    for epoch_i in range(0, epochs):

        

        # ========================================

        #               Training

        # ========================================

        

        # Perform one full pass over the training set.

        print("")

        print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))

        print('Training...')

        # Measure how long the training epoch takes.

        t0 = time.time()

        # Reset the total loss for this epoch.

        total_train_loss = 0

        # Put the model into training mode. Don't be mislead--the call to 

        # `train` just changes the *mode*, it doesn't *perform* the training.

        # `dropout` and `batchnorm` layers behave differently during training

        # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)

        model.train()

        # For each batch of training data...

        for step, batch in enumerate(train_dataloader):

            # Progress update every 40 batches.

            if step % 40 == 0 and not step == 0:

                # Calculate elapsed time in minutes.

                elapsed = format_time(time.time() - t0)

                

                # Report progress.

                print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

            # Unpack this training batch from our dataloader. 

            #

            # As we unpack the batch, we'll also copy each tensor to the GPU using the 

            # `to` method.

            #

            # `batch` contains three pytorch tensors:

            #   [0]: input ids 

            #   [1]: attention masks

            #   [2]: labels 

            b_input_ids = batch[0].to(device)

            b_input_mask = batch[1].to(device)

            b_labels = batch[2].to(device)

            # Always clear any previously calculated gradients before performing a

            # backward pass. PyTorch doesn't do this automatically because 

            # accumulating the gradients is "convenient while training RNNs". 

            # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)

            model.zero_grad()        

            # Perform a forward pass (evaluate the model on this training batch).

            # The documentation for this `model` function is here: 

            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification

            # It returns different numbers of parameters depending on what arguments

            # arge given and what flags are set. For our useage here, it returns

            # the loss (because we provided labels) and the "logits"--the model

            # outputs prior to activation.

            loss, logits = model(b_input_ids, 

                                token_type_ids=None, 

                                attention_mask=b_input_mask, 

                                labels=b_labels)

            # Accumulate the training loss over all of the batches so that we can

            # calculate the average loss at the end. `loss` is a Tensor containing a

            # single value; the `.item()` function just returns the Python value 

            # from the tensor.

            total_train_loss += loss.item()

            # Perform a backward pass to calculate the gradients.

            loss.backward()

            # Clip the norm of the gradients to 1.0.

            # This is to help prevent the "exploding gradients" problem.

            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters and take a step using the computed gradient.

            # The optimizer dictates the "update rule"--how the parameters are

            # modified based on their gradients, the learning rate, etc.

            optimizer.step()

            # Update the learning rate.

            scheduler.step()

        # Calculate the average loss over all of the batches.

        avg_train_loss = total_train_loss / len(train_dataloader)            

        

        # Measure how long this epoch took.

        training_time = format_time(time.time() - t0)

        print("")

        print("  Average training loss: {0:.2f}".format(avg_train_loss))

        print("  Training epcoh took: {:}".format(training_time))

            

        # ========================================

        #               Validation

        # ========================================

        # After the completion of each training epoch, measure our performance on

        # our validation set.

        print("")

        print("Running Validation...")

        t0 = time.time()

        # Put the model in evaluation mode--the dropout layers behave differently

        # during evaluation.

        model.eval()

        # Tracking variables 

        total_eval_accuracy = 0

        total_eval_loss = 0

        nb_eval_steps = 0

        # Evaluate data for one epoch

        for batch in validation_dataloader:

            

            # Unpack this training batch from our dataloader. 

            #

            # As we unpack the batch, we'll also copy each tensor to the GPU using 

            # the `to` method.

            #

            # `batch` contains three pytorch tensors:

            #   [0]: input ids 

            #   [1]: attention masks

            #   [2]: labels 

            b_input_ids = batch[0].to(device)

            b_input_mask = batch[1].to(device)

            b_labels = batch[2].to(device)

            

            # Tell pytorch not to bother with constructing the compute graph during

            # the forward pass, since this is only needed for backprop (training).

            with torch.no_grad():        

                # Forward pass, calculate logit predictions.

                # token_type_ids is the same as the "segment ids", which 

                # differentiates sentence 1 and 2 in 2-sentence tasks.

                # The documentation for this `model` function is here: 

                # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification

                # Get the "logits" output by the model. The "logits" are the output

                # values prior to applying an activation function like the softmax.

                (loss, logits) = model(b_input_ids, 

                                      token_type_ids=None, 

                                      attention_mask=b_input_mask,

                                      labels=b_labels)

                

            # Accumulate the validation loss.

            total_eval_loss += loss.item()

            # Move logits and labels to CPU

            logits = logits.detach().cpu().numpy()

            label_ids = b_labels.to('cpu').numpy()

            # Calculate the accuracy for this batch of test sentences, and

            # accumulate it over all batches.

            total_eval_accuracy += flat_accuracy(logits, label_ids)

            

        # Report the final accuracy for this validation run.

        avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)

        print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

        # Calculate the average loss over all of the batches.

        avg_val_loss = total_eval_loss / len(validation_dataloader)

        

        # Measure how long the validation run took.

        validation_time = format_time(time.time() - t0)

        

        print("  Validation Loss: {0:.2f}".format(avg_val_loss))

        print("  Validation took: {:}".format(validation_time))

        # Record all statistics from this epoch.

        training_stats.append(

            {

                'epoch': epoch_i + 1,

                'Training Loss': avg_train_loss,

                'Valid. Loss': avg_val_loss,

                'Valid. Accur.': avg_val_accuracy,

                'Training Time': training_time,

                'Validation Time': validation_time

            }

        )

    print("")

    print("Training complete!")

    print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))

    training_time_list.append(format_time(time.time()-total_t0))

    # Prediction on validation set

    print('Predicting labels for {:,} test sentences...'.format(len(input_ids_val)))

    # Put model in evaluation mode

    model.eval()

    # Tracking variables 

    predictions , true_labels = [], []

    # Predict 

    for batch in validation_dataloader:

      # Add batch to GPU

      batch = tuple(t.to(device) for t in batch)

      

      # Unpack the inputs from our dataloader

      b_input_ids, b_input_mask, b_labels = batch

      

      # Telling the model not to compute or store gradients, saving memory and 

      # speeding up prediction

      with torch.no_grad():

          # Forward pass, calculate logit predictions

          outputs = model(b_input_ids, token_type_ids=None, 

                          attention_mask=b_input_mask)

      logits = outputs[0]

      # Move logits and labels to CPU

      logits = logits.detach().cpu().numpy()

      label_ids = b_labels.to('cpu').numpy()

      

      # Store predictions and true labels

      predictions.append(logits)

      true_labels.append(label_ids)

    print('    DONE.')

    # Getting the MCC and other metrics 

    # Combine the results across all batches. 

    flat_predictions = np.concatenate(predictions, axis=0)

    # For each sample, pick the label (0 or 1) with the higher score.

    flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

    # Combine the correct labels for each batch into a single list.

    flat_true_labels = np.concatenate(true_labels, axis=0)

    from sklearn.metrics import recall_score, f1_score, matthews_corrcoef

    good_preds = 0

    for i in range(len(flat_predictions)):

      if (flat_predictions[i] == flat_true_labels[i]) == True:

        good_preds += 1

    accuracy = good_preds/len(flat_predictions)

    print('Accuracy: ', accuracy)

    # 0 label is our positive label (Dominant disturbance)

    print('Sensitivity: ', recall_score(flat_true_labels, flat_predictions, pos_label = 0))

    # 1 label is our negative label (Recessive disturbance)

    print('Specificity: ', recall_score(flat_true_labels, flat_predictions, pos_label = 1))

    print('F1 score: ', f1_score(flat_true_labels, flat_predictions))

    print('Matthews Cor. Coef.: ', matthews_corrcoef(flat_true_labels, flat_predictions))

    print('\n')

    print('#####\n')

    

    sample_list.append(sample)

    reading_list.append(reading)

    accuracy_list.append(accuracy)

    sensitivity_list.append(recall_score(flat_true_labels, flat_predictions, pos_label = 0))

    specificity_list.append(recall_score(flat_true_labels, flat_predictions, pos_label = 1))

    F1_list.append(f1_score(flat_true_labels, flat_predictions))

    MCC_list.append(matthews_corrcoef(flat_true_labels, flat_predictions))

    

  results_summary['Sample'] = sample_list

  results_summary['Reading'] = reading_list

  results_summary['Accuracy'] = accuracy_list

  results_summary['Sensitivity'] = sensitivity_list

  results_summary['Specificity'] = specificity_list

  results_summary['F1 Score'] = F1_list

  results_summary['MCC'] = MCC_list

  results_summary['Training time'] = training_time_list

  results_summary.to_csv(global_path, sep=',',index=False , encoding='latin-1')

  print(filename, ' saved!')

So you are running the same model, with a fixed seed. Why do you expect different output? The goal of setting the seed is exactly to make the process deterministic.

Hello Bram,

Thank you for having a look into my code. I agree with you, setting the seed should ensure the reproducibility of experiments. However, if I restart my Google Colab notebook and I rerun everything again, I get different results. May it be that cuda sessions have some randomness? Maybe one of these two chunks of code is adding such randomness:

This:

import tensorflow as tf

# Get the GPU device name.

device_name = tf.test.gpu_device_name()

# The device name should look like the following:

if device_name == '/device:GPU:0':

    print('Found GPU at: {}'.format(device_name))

else:

    raise SystemError('GPU device not found')

Or this:

import torch

# If there's a GPU available...

if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    

    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...

else:

    print('No GPU available, using the CPU instead.')

    device = torch.device("cpu")

May it be that virtual machines get assigned to different GPUs with different characteristics? This may explain the differences in results when I restart everything from scratch. I will carry some experiments where I put this part of the code in the loop. I’ll report here the results.

Thanks for your time and attention,

J.

[Update] Just tested by running everything again. I do get different results as when I restart & rerun again the whole Google Colab’s notebook.

I’m confused, though. Why are you using TF and PyTorch? Best to stick with one for a single experiment to avoid any conflicts. Generally speaking, using torch, this should make your code reproducible.

seed = 42
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
1 Like

Hello Bram. Thanks for your information. I was just (without any reason), using tf to get information about the GPU.

I set the seed as you suggested. However, I still have variable results when resetting and rerunning the Google Colab’s Notebook. Might it be internal mechanics of Google Colab? I don’t really see any other source of randomness.

Thank you for your help,

J.

how do you reinit the weights?