I am trying to figure out what is happening when we try to train/fine-tune Lanuguage model like BERT.
Here is the sample code:
model = BertForMaskedLM.from_pretrained("bert-base-german-cased")
model.config.is_decoder = True
lm_loss, prediction_scores, *_ = model(input_ids = input_ids, token_type_ids= token_type_ids, lm_labels = lm_labels)
What happens under the hood?? now the tokeized
input_ids example: [4,5,6,7,8,9,10,11]
token_type_ids:[0,0,0,0,1,1,1,1]
lm_labels:[-1,-1,-1,-1,8,9,10,11]
1.Lets say I pass the above to the model. Does it mean that thee loss is calculated only for last 4 values of the input_ids?
2.Does it mean network tries to learn that first 4 elements of input_ids are part of one sentence and other 4 is part of other sentence and it learns to predict the series?
I have kept the example as simple as possible so that it is easy to answer in simple way.