ELECTRA: Accounting for mask tokens that are correctly predicted by MLM

I have been pretraining ELECTRA models using the ElectraForPreTraining class, some of them taking in input_ids and others just taking the embeddings straight via the input_embeds parameter (I am training the model on biological data and assign particular tokens to particular species).

As I understand from the paper, during pretraining both the generator and discriminator are trained; the generator is a small masked language model and the discriminator aims to predict whether a token is an original or a token replaced by the generator. Additionally, as the paper states " if the generator happens to generate the correct token, that token is considered “real” instead of “fake”. However, the labels ElectraForPreTraining uses to determine the ground truth of “real” or “fake” are an array passed in by me - and I have no way of knowing whether the generator will correctly predict the token or not, resulting in the labels I pass in marking any masked tokens as “fake” whether or not the generator actually succeeds. Additionally, since the model doesn’t take in any information regarding the true identity of masked tokens it seems to me that the model has no way to handle this case. Is this being properly handled in the model? If so, what do I have to do to make sure I am passing in my data correctly?

inputs
| mask
masked inputs
| generator (Electraformlm)
gen logits
| gumbel softmax
generated
| discriminator(electraformpretraining)
disc logits

is_replaced = generated == inputs(x)
edited: is_replaced = generated != inputs

is replaced is the true label for discriminator

@RichardWang

Thanks for responding. I understand that is_replaced = generated == inputs would be the true label, but my issue is when an input is a masked token, generated will never equal the input, however, if the generated token is the value of the masked token before masking, then that particular token should not be considered replaced, which seems to not be the case under the setup you described. I am confused as to whether this is dealt with.

@rsvarma
I am not very clear if I got your point. To avoid ambiguity (e.g. what input ? input of generator or discriminator) you may can use words exactly what I typed .

inputs
| mask
masked inputs
| generator (Electraformlm)
gen logits
| gumbel softmax
generated
| discriminator(electraformpretraining)
disc logits

But I will try to list something so that we can confirm whether our understanding are the same.

  • You by yourself initialize a model instance ElectraForMaskedLM as generator and a model instanceElectraForPreTraining as discriminator
  • The input of generator is masked_inputs that you mask from inputs.
  • The input of discriminator is generated that you feed generator and then sample from the output of generator
  • generated could only be different from inputs on masked positions
  • generated on some positions of masked positions could be also the same with the original inputs, it depends on how generator can correctly predict on that positions

Example:
inputs: “I am a super hero . You are too !”
masked_inputs: “I [MASK] a super hero. You [MASK] too !”
generated: “I is a super hero . You are too !”
is_replaced: “F T F F F F F F F F”

@RichardWang

I think, I am starting to see where I might be misunderstanding. I’ll outline what I’ve been assuming so that any misunderstandings can be cleared up

I have been assuming that ElectraForPreTraining is supposed to handle training both the generator and the discriminator, as my understanding from the paper was that they were both trained concurrently. As a result, I was confused as to how ElectraForPreTraining could possibly train the generator without labels for the MLM. However , after reading your post I am starting to think that this assumption is incorrect. This leads me to the following questions:

  1. Does ElectraForPreTraining only train the discriminator?

  2. Am I supposed to first train an ElectraForMaskedLM to train the generator and then load it into ElectraForPreTraining in order to train the discriminator?

  3. Is there any way to train the generator and discriminator at the same time using huggingface? As I mentioned before, it is my understanding from the paper that they are both supposed to be trained at the same time.

@rsvarma

  1. yes
  2. No, train the two models at the same time.
  3. Hmm. You are searching for ELETRA training script. There are some implementations use huggingface.
  • simpletransformers
    Although I didn’t read all the code, I found there is something different to the paper. (e.g. didn’t tie also layernorm and dropout of embedding layers of generator and discriminator)
  • ELECTA trainer from huggingface/ transformer itself
    But it is still a pr and got stuck by some problems for months. And use a different sampling method from the paper’s.
  • My implementation
    I personally believe it is most close to the official implementation. But I am doing the final test and haven’t clean my code and upload docs.

The first two didn’t validate the code by replicating the paper’s results. As to my implementation, I am doing it, but the release might take at least 1 week or longer.

Hi @rsvarma !

FYI, my ELECTRA reimplementation has successfully replicated the results in the paper.

:computer:Code: https://github.com/richarddwang/electra_pytorch
:newspaper:Post: ELECTRA training reimplementation and discussion

1 Like

Hi Richard,

Excellent, I checked out your code and used bits to get my pretraining done. Thanks for your help!

1 Like