Use of "input_ids,token_type_ids and lm_labels" in BERT Language model

vikasRajashekar · September 19, 2020, 9:36pm

I am trying to figure out what is happening when we try to train/fine-tune Lanuguage model like BERT.

Here is the sample code:

model = BertForMaskedLM.from_pretrained("bert-base-german-cased")
model.config.is_decoder = True

lm_loss, prediction_scores, *_ = model(input_ids = input_ids, token_type_ids= token_type_ids, lm_labels = lm_labels)

What happens under the hood?? now the tokeized
input_ids example: [4,5,6,7,8,9,10,11]
token_type_ids:[0,0,0,0,1,1,1,1]
lm_labels:[-1,-1,-1,-1,8,9,10,11]

1.Lets say I pass the above to the model. Does it mean that thee loss is calculated only for last 4 values of the input_ids?
2.Does it mean network tries to learn that first 4 elements of input_ids are part of one sentence and other 4 is part of other sentence and it learns to predict the series?

I have kept the example as simple as possible so that it is easy to answer in simple way.

RichardWang · September 20, 2020, 2:05am

Hi @vikasRajashekar,
I assume what you said lm_labels is labels, and -1 is -100. (see docs here)

Yes

The model tries to learn to predict the last 4 tokens from the context. The context is all input tokens, includes the last four tokens, even the last four tokens are masked or replaced tokens, they contribute correct position information to context. Any way, all input tokens will be used to predict the last four tokens.

Topic		Replies	Views
Questions on the `BertModelLMHeadModel` 🤗Transformers	7	6304	October 5, 2020
BertForMaskedLM train 🤗Transformers	2	786	January 20, 2021
Do I need token_type_ids for BertForSequenceClassification? 🤗Transformers	2	216	October 12, 2020
Apply BertForTokenClassification on partially labeled input 🤗Transformers	0	263	November 16, 2021
T5 fine tuning, loss difference when using labels and decoder_input_ids 🤗Transformers	2	1187	October 12, 2020

Use of "input_ids,token_type_ids and lm_labels" in BERT Language model

Related topics