I’m currently working in research in transfer learning, and I’m trying to use bert-base-cased
as a pretrained baseline, wrapped in a Pytorch model with dropout and a linear layer. I largely follow the recommendations of BERT and train using AdamW optimizer and a scheduler.
My issue is, I can train on one task no problem, with the BERT recommended parameters of LR=2e-5, batch size=32, epochs=2. I use a cross entropy loss with class weighting to address a label imbalance issue.
My issue is when I save this model, then reload the state as a base for further downstream training on a different classification task, it never finishes a single epoch. What’s also strange is I get 100% GPU utilisation, which doesn’t change (not the same on the baseline models).
def train(self, optimizer, scheduler, minibatches: torch.utils.data.DataLoader) -> Dict: self.model = self.model.train() metrics = {"n_correct": 0, "losses": []} for batch in minibatches: texts, targets = batch targets = targets.to(self.device) encoded_input = BERTPreprocessor.encode(texts, self.tokenizer) for tensor in encoded_input: encoded_input[tensor] = (encoded_input[tensor] .to(self.device)) logits, *_ = self.model(**{ "input_ids": encoded_input["input_ids"], "attention_mask": encoded_input["attention_mask"], }) loss = self.loss_fn(logits, targets) _, preds = torch.max(logits, dim=1) metrics["n_correct"] += torch.sum(preds == targets) metrics["losses"].append(loss.item()) loss.backward() torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0) optimizer.step() scheduler.step() optimizer.zero_grad() return metrics
I’ve attached my training process above and I’m happy to provide any further information, but as I’m fairly new to the field, I’m at a loss with how to address this.
To add, the weighting scheme I use for the weight
parameter in the loss function is num_minority_class/class_n
for the negative and positive classes (binary classification).