I’m trying to finetune a BERT model to do token classification, and I’m wondering exactly how the finetuning is done. When I was pretraining the model it was done by using Masked language model and masking out 15% as suggested in the original paper. But now when I’m finetuning the model, do I also mask out 15% of the input/target data or do I no longer do that and just train it on the unmasked data?
Hi @tueboesen, With BERT maksed language modelling is used as a pre-training task. For fine-tuning MLM is not used.