I am running a multi-objective problem where I compute three losses and then sum them up. For each loss, I want to have a learnable coefficient (alpha, beta, and gamma, respectively) that will be optimized.
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
for batch in dl:
optimizer.zero_grad()
result = model(batch)
loss1 = loss_fn_1(result)
loss2 = loss_fn_2(result)
loss3 = loss_fn_3(result)
# How to optimize alpha, beta, and gamma?
loss = alpha*loss1 + beta*loss2 + gamma*loss3
loss.backward()
optimizer.step()
Specific questions:
Should I even have coefficients alpha, beta, and gamma? The optimizer will minimize, so they’ll all go to 0.0, right?
If having those coefficients is a good idea, how can I prevent them from going to 0.0? Someone told me to use regularization, but what does that mean in this case?
How do I declare alpha, beta, and gamma to be learnable by AdamW?
Theoretically, we have to make a constraint like alpha+beta+gamma = 1. To change this to unconstrained optimization, we have to use Lagrange multiplier to the constraint equation, and that will be the regularization formula your friend talked about e.g. you put
lambda1*alpha, lambda2*beta and lambda3*gamma
into loss function. I believe it complicates the problem even more since finding optimum values of lambdas are difficult even theoretically.
2.5 Sorry not answer you Q3, but I think the practical way is to treat alpha, beta and gamma as hyperparameters and simply optimize them via grid search.
In this case, simply split some of your training set to validation set, and define the metric on it. The “validation metric” has to be specified to be suitable to your problem (e.g. error, f1, spearman or any others) — you can get some ideas on metrics by finding some Kaggle competitions that is similar to your problem and see their metrics.
Select hyperparaeters that optimize your validation metric.