Fine-tuning BERT with multiple classification heads

SaraAmd · October 14, 2022, 4:27pm

I need to train a model that has the same backbone such as BERT as a feature extractor and use multiple classification heads. This scenario is similar to multi-task learning but all the tasks are classification tasks so I will need multiple classification heads. does anyone have any similar notebook code that I can start with?

shensmobile · October 24, 2022, 6:35pm

HI SaraAmd, I’m looking at doing something very similar. I currently have ~15 classification models that all use the same language model. I have a feeling that I will have to write my own forward() function in the end. I’m curious if you made any progress, and if so, I was wondering if you could provide me some wisdom on the subject. I’m wondering if I need to re-train all of these models at the same time, or if I can extract the classification heads from my existing models and pack them together into a new model with a custom forward() function. If you found any existing literature that could put me on the right path, that’d be awesome too

SaraAmd · October 24, 2022, 7:01pm

Hi, Unfortunately, I haven’t made any progress and I haven’t found any literature. But I also assume that the forward function needs to be implemented as well as the loss function for each classification head (all will have cross entropy but how I am not sure). We can brainstorm together to move forward. My goal is to have one BERT model as the feature extractor and then add n-number of classification heads on top of it to train the model is this your goal as well? I feel like you want to train different models separately.

shensmobile · October 24, 2022, 7:49pm

Yes that is exactly my goal. Currently I am just barely able to fit most of my models onto my GPU at inference time, but the goal will be to have a smaller footprint on my GPU so that I can process data in batches instead. I don’t mind writing the forward function, I will likely begin tinkering with it next week. I just need to figure out how to extract the classification heads and pickle them so that I can import them into one large model.

As I mentioned, I have all my models trained separately already But if someone has pre-written code to do multi-head training all at once, I don’t mind re-training if it means re-using code. I’m happy either way. I’ll share code with you as soon as I start putting pen to paper.

swtb · January 10, 2024, 4:15pm

Did you folks managed to find a solution to this in the end? I’ve tried overwriting several classes to handle multihead classification for coarse and fine grained labels.

shensmobile · January 16, 2024, 5:34pm

No, I unfortunately did not get around to tackling this. I ended up brute forcing it with a bigger GPU with more memory. I still want to solve this in the long term though, but I think the only answer is to write a custom forward() function that creates multiple parallel output linear layers that each generate one set of prediction/classifications.

swtb · January 17, 2024, 9:01am

I think you are right. In fact, I am able to get the model training by overwriting the following:

compute_loss
forward
the model itself, to add the layers
datacollator
get_train_dataloader/ eval_dataloder

I get it to train but I cannot get it to compute any metrics, for some reason the HF Trainer is skipping over compute_metrics at eval time.

nielsr · January 17, 2024, 6:17pm

Have you considered Google? It’s a pretty good search engine

swtb · January 19, 2024, 12:17pm

I do appreciate the article, it may be useful for the other posters but it doesn’t give me any clues on getting the compute metrics function to work during the evaluation step. So far I have narrowed it down to the all_labels variable being set to None prior to this function being called in the eval_loop.

On the whole, as a senior community member, I think its worth reflecting on how a response like this creates a culture in which newer community members lose the confidence to even ask questions.

Thank you for sharing the article.

nielsr · January 19, 2024, 12:40pm

Hi,

Sorry for the earlier reply! One should be able to ask any question.

One can add a compute_metrics function to the blog post above. The compute_metrics function takes in a named tuple, which one can split into labels and logits, e.g.:

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Hence, in case of a multi-task model as shown in the blog above, the logits depend on the task head being used. To circumvent this, one could add an attribute to the model’s init called “task_name”, which you can change depending on which task you want to evaluate for. Then you can call trainer.evaluate() depending on which task is set.

swtb · January 19, 2024, 1:22pm

Thanks again, this is helpful for me. Am I right in thinking that the task_name variable effectively chooses the right head task dependently? If so, what if a single task requires two heads? Such as in my own case I need to train the model to give a coarse and a fine-grained output.

For some context, here is my loss function:

  def compute_loss(self, model, inputs, return_outputs=False):

    labels = inputs.pop('labels')

    labels_coarse = labels[0][0]
    labels_fine = labels[0][1]

    outputs = model(**inputs)
    logits_coarse, logits_fine = outputs

    loss_fct = torch.nn.CrossEntropyLoss()
    loss_coarse = loss_fct(logits_coarse.view(-1, num_labels_coarse), labels_coarse.view(-1))
    loss_fine = loss_fct(logits_fine.view(-1, num_labels_fine), labels_fine.view(-1))
    
    loss = (loss_coarse + loss_fine) / 2

    return (loss, outputs) if return_outputs else loss

Topic		Replies	Views
Defining custom compute_metrics for multiclass classification 🤗Transformers	0	973	May 16, 2023
Fine-Tune BERT with two Classification Heads "next to each other"? Beginners	3	2775	September 17, 2021
How do I do multi Class (multi head) classification? 🤗Transformers	6	4523	October 18, 2022
Extracting the output of hidden BERT layers and re-training the BERT model on custom datasets 🤗Transformers	0	820	March 17, 2021
Multi-task learning for masked language modeling and token classification 🤗Transformers	0	608	October 5, 2021

Fine-tuning BERT with multiple classification heads

Related topics