Hello all,
I run a Hyperparameter search using Optuna and got a model giving me 83% accuracy. When I then try and repeat this by retraining using the same hyperparameter (including seed), I cannot repeat the results. This is my trainer arguments and optuna search;
# Define the trainig arguments
training_args = TrainingArguments(
output_dir='./results', # output directory
seed = 0,
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=16, # batch size for evaluation
warmup_steps=22, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
learning_rate=5e-5, # initial learning rate for AdamW optimizer.
load_best_model_at_end=False, # load the best model when finished training (default metric is loss)
do_train=True, # Perform training
do_eval=True, # Perform evaluation
logging_dir='./logs', # directory for storing logs
logging_steps=10,
gradient_accumulation_steps=2, # total number of steps before back propagation
fp16=True, # Use mixed precision
fp16_opt_level="02", # mixed precision mode
evaluation_strategy="epoch", # evaluate each `logging_steps`
save_strategy = 'no', # The checkpoint save strategy to adopt during training. I dont want to save, probably why it did save and take up disk space in HP search
#save_total_limit = 1. # Trying this to stop octuna from saving
)
trainer = Trainer(
model_init=model_init,
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset, # evaluation dataset
compute_metrics=compute_metrics,
#callbacks=[EarlyStoppingCallback(3, 0.0)] # early stopping if results dont improve after 3 epochs
)
best_run = trainer.hyperparameter_search(direction="maximize",
hp_space=my_hp_space,
compute_objective=my_objective, # cant get this working, for now work with loss
n_trials=50,
pruner=optuna.pruners.NopPruner(),
sampler=optuna.samplers.GridSampler(search_space),
study_name=name,
storage="sqlite:////content/drive/MyDrive/{}.db".format(name), #change this to a local directory if you want to save to disk
load_if_exists=True # you can change this to true, for continuing the search
)
best_run
I have now also fixed the seed for numpy and torch
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
Could it be that the classification head that being reinitialised every time I retrain is random, resulting in different results?