I am doing fine-tuning and pre-training bert and electra. But I really don’t understand how could those two models be trained same way.
Let’s say bert.
bert is trained using Masked Language Model(MLM) and next sentence prediction so logically if I fine-tuning bert using MLM, bert embedding would be more appropriate than before.
Problem is electra.
transformers documentation suggest same fine-tuning step like bert.
electra does not use MLM instead using discriminators. electra predict which token is fake and which token is true. similar with gan.
So logically If i try to fine-tuning electra, I guess i have to train electra which token is fake and which is true.
But transformers documentation suggest MLM. Even normal fine tuning also same.
model( input_ids, attention_mask, token_type_ids ,…etc)
this are the parameter which electra model get.
but what i think is this kinds of parameter should work for bert not electra.
anyone pls help me.