Hi
I have modified a BERT model a bit and adds small “Linear” layers between its layer, the only random part is random initalization done for these layers as below:
W = torch.nn.init.xavier_normal_(tensor, gain=math.sqrt(2))
I have put these initialization when defining each layers. I am getting each time 3-4% difference, and really appreciate your help to fix this issue.
Could you please help me on how I should handle initialization on top of a BERT model, should them all go inside __init_weights(), would that differ if one does it inside this function or anywhere in the model?
Huggingface run_glue.py fix the random seeds on top of all the lines, shall I redo it each time for the initialization?
I am really struggling with this issue and appreciate your help a lot. @sgugger@stas
Hi
I confirm the same issue happens also for the BERT model without any modifications, for this I run it on MRPC for 3 epochs, here is the two results: