Fine-tuning BERT with multiple classification heads

I need to train a model that has the same backbone such as BERT as a feature extractor and use multiple classification heads. This scenario is similar to multi-task learning but all the tasks are classification tasks so I will need multiple classification heads. does anyone have any similar notebook code that I can start with?

HI SaraAmd, I’m looking at doing something very similar. I currently have ~15 classification models that all use the same language model. I have a feeling that I will have to write my own forward() function in the end. I’m curious if you made any progress, and if so, I was wondering if you could provide me some wisdom on the subject. I’m wondering if I need to re-train all of these models at the same time, or if I can extract the classification heads from my existing models and pack them together into a new model with a custom forward() function. If you found any existing literature that could put me on the right path, that’d be awesome too :slight_smile:

Hi, Unfortunately, I haven’t made any progress and I haven’t found any literature. But I also assume that the forward function needs to be implemented as well as the loss function for each classification head (all will have cross entropy but how I am not sure). We can brainstorm together to move forward. My goal is to have one BERT model as the feature extractor and then add n-number of classification heads on top of it to train the model is this your goal as well? I feel like you want to train different models separately.

Yes that is exactly my goal. Currently I am just barely able to fit most of my models onto my GPU at inference time, but the goal will be to have a smaller footprint on my GPU so that I can process data in batches instead. I don’t mind writing the forward function, I will likely begin tinkering with it next week. I just need to figure out how to extract the classification heads and pickle them so that I can import them into one large model.

As I mentioned, I have all my models trained separately already :slight_smile: But if someone has pre-written code to do multi-head training all at once, I don’t mind re-training if it means re-using code. I’m happy either way. I’ll share code with you as soon as I start putting pen to paper.

Did you folks managed to find a solution to this in the end? I’ve tried overwriting several classes to handle multihead classification for coarse and fine grained labels.

No, I unfortunately did not get around to tackling this. I ended up brute forcing it with a bigger GPU with more memory. I still want to solve this in the long term though, but I think the only answer is to write a custom forward() function that creates multiple parallel output linear layers that each generate one set of prediction/classifications.

I think you are right. In fact, I am able to get the model training by overwriting the following:

  • compute_loss
  • forward
  • the model itself, to add the layers
  • datacollator
  • get_train_dataloader/ eval_dataloder

I get it to train but I cannot get it to compute any metrics, for some reason the HF Trainer is skipping over compute_metrics at eval time.

Have you considered Google? It’s a pretty good search engine :wink:

I do appreciate the article, it may be useful for the other posters but it doesn’t give me any clues on getting the compute metrics function to work during the evaluation step. So far I have narrowed it down to the all_labels variable being set to None prior to this function being called in the eval_loop.

On the whole, as a senior community member, I think its worth reflecting on how a response like this creates a culture in which newer community members lose the confidence to even ask questions.

Thank you for sharing the article.

1 Like

Hi,

Sorry for the earlier reply! One should be able to ask any question.

One can add a compute_metrics function to the blog post above. The compute_metrics function takes in a named tuple, which one can split into labels and logits, e.g.:

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Hence, in case of a multi-task model as shown in the blog above, the logits depend on the task head being used. To circumvent this, one could add an attribute to the model’s init called “task_name”, which you can change depending on which task you want to evaluate for. Then you can call trainer.evaluate() depending on which task is set.

1 Like

Thanks again, this is helpful for me. Am I right in thinking that the task_name variable effectively chooses the right head task dependently? If so, what if a single task requires two heads? Such as in my own case I need to train the model to give a coarse and a fine-grained output.

For some context, here is my loss function:

  def compute_loss(self, model, inputs, return_outputs=False):

    labels = inputs.pop('labels')

    labels_coarse = labels[0][0]
    labels_fine = labels[0][1]

    outputs = model(**inputs)
    logits_coarse, logits_fine = outputs

    loss_fct = torch.nn.CrossEntropyLoss()
    loss_coarse = loss_fct(logits_coarse.view(-1, num_labels_coarse), labels_coarse.view(-1))
    loss_fine = loss_fct(logits_fine.view(-1, num_labels_fine), labels_fine.view(-1))
    
    loss = (loss_coarse + loss_fine) / 2

    return (loss, outputs) if return_outputs else loss