ELECTRA Paper Doubts

Hello Everyone,
I am Srinjoy, a master’s student currently studying NLP. I was reading the ELECTRA paper by Clark et al. I learned about the implementation and had a few doubts.

I was wondering if you could help me with those.

  1. What exactly does the “Step” mean in step count? Does it mean 1 epoch or 1 minibatch?
    1. Also, in the paper I saw (specifically in Table 1), ELECTRA-SMALL and BERT-SMALL both have 14M parameters, how is that possible as ELECTRA should have more parameters because its generator and discriminator module are both BERT-based?
  2. Also, what is the architecture of both the generator and discriminator? Are they both BERT to something else?
  3. Also, we have a sampling step between the generator and the discriminator. How are you back-propagating the gradients through this?

Thanks in advance