Does fine-tuning a language model modify its hidden weights?

If i load a pre-trained language model (say BERT) and use a standard pytorch implementation (as shown in the code-block), are the weights of the bert-model updated ? and if not is it recommended to try doing so for a down-stream task ?

Note: the task involves using the bert embeddings for clustering text using a clustering algorithm (like KMeans, DBSCAN, etc…)

class Model(nn.Module):
  def __init__(self, name):
    super(Model, self).__init__()
    self.bert = transformers.BertModel.from_pretrained(config['MODEL_ID'], return_dict=False)
    self.bert_drop = nn.Dropout(0.0)
    self.out = nn.Linear(config['HIDDEN_SIZE'], config['NUM_LABELS'])
    self.model_name = name
  
  def forward(self, ids, mask, token_type_ids):
    _, o2 = self.bert(ids, attention_mask = mask, token_type_ids = token_type_ids)
    bo = self.bert_drop(o2)
    output = self.out(bo)
    return output

When you do optimizer = torch.optimizer.SGD(Model("bert").parameters(), lr=1e-3) the parameters of the pretrained transformer are ready to be updated in addition to other layers you introduce. In my experience, updating only the newly introduced layers during fine-tuning resulted in very slow convergence and I would recommend updating the transformer weights as well.

If you would use a clustering algorithm that assumes embeddings close to each other in the vector space are semantically related, I recommend that you use a loss function to enforce such a beahavior in fine-tuning (maybe Triplet Loss with Siamese Modeling like SBERT) because CLS embedding space, so to speak, is not constructed with such concerns like context-independent word embedding spaces.

2 Likes