I have been pretraining ELECTRA models using the ElectraForPreTraining class, some of them taking in input_ids and others just taking the embeddings straight via the input_embeds parameter (I am training the model on biological data and assign particular tokens to particular species).
As I understand from the paper, during pretraining both the generator and discriminator are trained; the generator is a small masked language model and the discriminator aims to predict whether a token is an original or a token replaced by the generator. Additionally, as the paper states " if the generator happens to generate the correct token, that token is considered “real” instead of “fake”. However, the labels ElectraForPreTraining uses to determine the ground truth of “real” or “fake” are an array passed in by me - and I have no way of knowing whether the generator will correctly predict the token or not, resulting in the labels I pass in marking any masked tokens as “fake” whether or not the generator actually succeeds. Additionally, since the model doesn’t take in any information regarding the true identity of masked tokens it seems to me that the model has no way to handle this case. Is this being properly handled in the model? If so, what do I have to do to make sure I am passing in my data correctly?