i’m trying to train a bert model on the task of recovering masked segments from a sequence.
unlike usual procedure, my input is already a sequence of vector embeddings, representing some sequential data. i want to train a model such that given such a sequence, masked in 15% of it’s content, the model would learn to recover the initial vectors.
what is not clear to me in the BertForMaskedLM configuration is where does the masking take place and where does the loss get computed.
in my ‘pytorchic’ view, i should program a dataloader to clone and mask parts of the input, and feed both as x,y in the get_item func, then have the model fed with x and output something most similar to y.
i understand that in bert this happens inside but i can’t figure out how.
so far, i used the BertForMaskedLM and fed it with the argument ‘inputs_embeds’, and the output was a probability distribution over the vocabulary size (which in my case means nothing because my input is already an embedding and should remain so).
i couldn’t track if any masking occurs, though i’m pretty certain ‘data_collator’ wasn’t called, and that’s where the masking should take place.
hope it’s not too big of a question!!