I am Srinjoy, a master’s student currently studying NLP. I was reading the ELECTRA paper by Clark et al. I learned about the implementation and had a few doubts.
I was wondering if you could help me with those.
- What exactly does the “Step” mean in step count? Does it mean 1 epoch or 1 minibatch?
- Also, in the paper I saw (specifically in Table 1), ELECTRA-SMALL and BERT-SMALL both have 14M parameters, how is that possible as ELECTRA should have more parameters because its generator and discriminator module are both BERT-based?
- Also, what is the architecture of both the generator and discriminator? Are they both BERT to something else?
- Also, we have a sampling step between the generator and the discriminator. How are you back-propagating the gradients through this?
Thanks in advance