I recently read this paper: WARP https://aclanthology.org/2021.acl-long.381.pdf that introduces soft verbalizer (I believe this term was coined by other papers).
In my understanding, they remove the decoder in the LM head of the masked model and replace it with a linear layer, where the output is the number of verbalizers. I think it has a similar concept to replacing the LM head in LM for sentence classification.
I want to try this scheme and train this soft verbalizer only. However, I can not find a similar tutorial out there. Can I do that on HF by only changing the LM Head, like:
model.lm_head.decoder = torch.nn.Linear(768, 2, bias=True).to("cuda")
Is there any caveat to changing the decoder for the training process?