ELECTRA: Accounting for mask tokens that are correctly predicted by MLM

rsvarma · August 5, 2020, 12:35am

I have been pretraining ELECTRA models using the ElectraForPreTraining class, some of them taking in input_ids and others just taking the embeddings straight via the input_embeds parameter (I am training the model on biological data and assign particular tokens to particular species).

As I understand from the paper, during pretraining both the generator and discriminator are trained; the generator is a small masked language model and the discriminator aims to predict whether a token is an original or a token replaced by the generator. Additionally, as the paper states " if the generator happens to generate the correct token, that token is considered “real” instead of “fake”. However, the labels ElectraForPreTraining uses to determine the ground truth of “real” or “fake” are an array passed in by me - and I have no way of knowing whether the generator will correctly predict the token or not, resulting in the labels I pass in marking any masked tokens as “fake” whether or not the generator actually succeeds. Additionally, since the model doesn’t take in any information regarding the true identity of masked tokens it seems to me that the model has no way to handle this case. Is this being properly handled in the model? If so, what do I have to do to make sure I am passing in my data correctly?

RichardWang · August 10, 2020, 7:33am

inputs
| mask
masked inputs
| generator (Electraformlm)
gen logits
| gumbel softmax
generated
| discriminator(electraformpretraining)
disc logits

is_replaced = generated == inputs(x)
edited: is_replaced = generated != inputs

is replaced is the true label for discriminator

rsvarma · August 11, 2020, 12:30am

@RichardWang

Thanks for responding. I understand that is_replaced = generated == inputs would be the true label, but my issue is when an input is a masked token, generated will never equal the input, however, if the generated token is the value of the masked token before masking, then that particular token should not be considered replaced, which seems to not be the case under the setup you described. I am confused as to whether this is dealt with.

RichardWang · August 11, 2020, 3:04am

@rsvarma
I am not very clear if I got your point. To avoid ambiguity (e.g. what input ? input of generator or discriminator) you may can use words exactly what I typed .

inputs
| mask
masked inputs
| generator (Electraformlm)
gen logits
| gumbel softmax
generated
| discriminator(electraformpretraining)
disc logits

But I will try to list something so that we can confirm whether our understanding are the same.

You by yourself initialize a model instance ElectraForMaskedLM as generator and a model instanceElectraForPreTraining as discriminator
The input of generator is masked_inputs that you mask from inputs.
The input of discriminator is generated that you feed generator and then sample from the output of generator
generated could only be different from inputs on masked positions
generated on some positions of masked positions could be also the same with the original inputs, it depends on how generator can correctly predict on that positions

Example:
inputs: “I am a super hero . You are too !”
masked_inputs: “I [MASK] a super hero. You [MASK] too !”
generated: “I is a super hero . You are too !”
is_replaced: “F T F F F F F F F F”

rsvarma · August 12, 2020, 2:12am

@RichardWang

I think, I am starting to see where I might be misunderstanding. I’ll outline what I’ve been assuming so that any misunderstandings can be cleared up

I have been assuming that ElectraForPreTraining is supposed to handle training both the generator and the discriminator, as my understanding from the paper was that they were both trained concurrently. As a result, I was confused as to how ElectraForPreTraining could possibly train the generator without labels for the MLM. However , after reading your post I am starting to think that this assumption is incorrect. This leads me to the following questions:

Does ElectraForPreTraining only train the discriminator?
Am I supposed to first train an ElectraForMaskedLM to train the generator and then load it into ElectraForPreTraining in order to train the discriminator?
Is there any way to train the generator and discriminator at the same time using huggingface? As I mentioned before, it is my understanding from the paper that they are both supposed to be trained at the same time.

RichardWang · August 12, 2020, 6:15am

@rsvarma

yes
No, train the two models at the same time.
Hmm. You are searching for ELETRA training script. There are some implementations use huggingface.

simpletransformers
Although I didn’t read all the code, I found there is something different to the paper. (e.g. didn’t tie also layernorm and dropout of embedding layers of generator and discriminator)
ELECTA trainer from huggingface/ transformer itself
But it is still a pr and got stuck by some problems for months. And use a different sampling method from the paper’s.
My implementation
I personally believe it is most close to the official implementation. But I am doing the final test and haven’t clean my code and upload docs.

The first two didn’t validate the code by replicating the paper’s results. As to my implementation, I am doing it, but the release might take at least 1 week or longer.

RichardWang · September 6, 2020, 10:00am

Hi @rsvarma !

FYI, my ELECTRA reimplementation has successfully replicated the results in the paper.

Code: https://github.com/richarddwang/electra_pytorch
Post: ELECTRA training reimplementation and discussion

rsvarma · September 9, 2020, 3:51am

Hi Richard,

Excellent, I checked out your code and used bits to get my pretraining done. Thanks for your help!

Gianluca · May 14, 2021, 6:37pm

Hi Richard, I’m interested in using your script. Since the README says that you train Electra-small from scratch, I was wondering if your script could be used to pre-train Electra-base or Electra-large starting from the already learned weights from google?

Thanks.

RichardWang · May 15, 2021, 2:02am

Hi, Gianluca.

It should be possible with some modification.
I created an issue to detail how to do it.

github.com/richarddwang/electra_pytorch

How can I pretrain ELECTRA starting from weights from google ?

opened 01:58AM - 15 May 21 UTC

closed 01:58AM - 15 May 21 UTC

richarddwang

This issue is to answer the question from hugggingface forum. Although I have…n't tried it, it should be possible. 1. Make sure `my_model` is set to `False` to use huggingface model https://github.com/richarddwang/electra_pytorch/blob/ab29d03e69c6fb37df238e653c8d1a81240e3dd6/pretrain.py#L43 2. Change `model(config)` -> `model.from_pretrained(model_name)` https://github.com/richarddwang/electra_pytorch/blob/ab29d03e69c6fb37df238e653c8d1a81240e3dd6/pretrain.py#L364-L365 3. Be careful about `size`, `max_length`, and other configs https://github.com/richarddwang/electra_pytorch/blob/ab29d03e69c6fb37df238e653c8d1a81240e3dd6/pretrain.py#L38 https://github.com/richarddwang/electra_pytorch/blob/ab29d03e69c6fb37df238e653c8d1a81240e3dd6/pretrain.py#L76-L81 Note: ELECTRA models published are actually ++ model described in appendix D, and max sequence length of ELECTRA-Small/ Small++ is 128/512. Feel free to tag me if you have other questions.

Topic		Replies	Views
ELECTRA training reimplementation and discussion Research	14	6681	September 17, 2023
How could different model be trained same way? Beginners	0	207	July 26, 2021
How pretrain ELECTRA on custom dataset? Beginners	5	4146	September 6, 2020
No weights has been used to initialize the model 🤗Transformers	0	346	February 26, 2023
Pre - Train model with inputs_embeds 🤗Transformers	0	373	July 4, 2023

ELECTRA: Accounting for mask tokens that are correctly predicted by MLM

Related topics