Use of "input_ids,token_type_ids and lm_labels" in BERT Language model

I am trying to figure out what is happening when we try to train/fine-tune Lanuguage model like BERT.

Here is the sample code:

model = BertForMaskedLM.from_pretrained("bert-base-german-cased")
model.config.is_decoder = True

lm_loss, prediction_scores, *_ = model(input_ids = input_ids, token_type_ids= token_type_ids, lm_labels = lm_labels)

What happens under the hood?? now the tokeized
input_ids example: [4,5,6,7,8,9,10,11]

1.Lets say I pass the above to the model. Does it mean that thee loss is calculated only for last 4 values of the input_ids?
2.Does it mean network tries to learn that first 4 elements of input_ids are part of one sentence and other 4 is part of other sentence and it learns to predict the series?

I have kept the example as simple as possible so that it is easy to answer in simple way.

Hi @vikasRajashekar,
I assume what you said lm_labels is labels, and -1 is -100. (see docs here)


The model tries to learn to predict the last 4 tokens from the context. The context is all input tokens, includes the last four tokens, even the last four tokens are masked or replaced tokens, they contribute correct position information to context. Any way, all input tokens will be used to predict the last four tokens.