Hi there - we are using this BERT architecture from Google:
* attention_probs_dropout_prob:0.1
* hidden_act:"gelu"
* hidden_dropout_prob:0.1
* hidden_size:768
* initializer_range:0.02
* intermediate_size:3072
* max_position_embeddings:512
* num_attention_heads:12
* num_hidden_layers:12
* type_vocab_size:2
* vocab_size:32000
we trained it from scratch with 10 millions of documents from our very specific domain and also changed optimizers and sentence tokenizer. Now, our BERT works wonderfully, it is evaluated on mask_token and next sentence prediction and we can also fine tune it for e.g. downstream Classfication tasks - all of that works. As long as we stay in the TensorFlow/Nvidia world.
We would however love to move into all the possibilities of PyTorch as well. So we applied the following script: " convert_bert_original_tf_checkpoint_to_pytorch.py"
This script initially fails, because our optimizer are not recognized. By adding them to the list of skipped attributes
if any(n in ["adam_v", "adam_m", "*AdamWeightDecayOptimizer", "AdamWeightDecayOptimizer_1", "global_step"*] for n in name):
The scripts runs through fine and we get a working torch model. And here things become very strange. Initially we tried to fine tune the torch model of our BERT and nothing happened - inspecting the attention values and outputs for input tokens we realized that for any given input sequence ALL attention values are identical, yet different when the input changes. Also the output values in the hidden stated of the converted BASE model are identical for each token, irrespective of the input.
Nevertheless the conversion seems to work in that way that the model can be loaded and is a fully valid Pytorch model file.
I understand that all of that is quite vague but maybe it sounds familiar to somebody and or we get a hint where to look next.
The manifestation of that effect:
outputs = torch_bert_model(**example_input)
print(outputs[0]) # last hidden state, irrespective of input always this:
tensor([[[ 1.5048e-04, -4.1375e-06, 1.4493e-04, ..., 2.1820e-04,
-6.6411e-05, 2.1333e-04],
[ 1.5048e-04, -4.1375e-06, 1.4493e-04, ..., 2.1820e-04,
-6.6411e-05, 2.1333e-04],
[ 1.5048e-04, -4.1375e-06, 1.4493e-04, ..., 2.1820e-04,
-6.6411e-05, 2.1333e-04],
...,
[ 1.5048e-04, -4.1375e-06, 1.4493e-04, ..., 2.1820e-04,
-6.6411e-05, 2.1333e-04],
[ 1.5048e-04, -4.1375e-06, 1.4493e-04, ..., 2.1820e-04,
-6.6411e-05, 2.1333e-04],
[ 1.5048e-04, -4.1375e-06, 1.4493e-04, ..., 2.1820e-04,
-6.6411e-05, 2.1333e-04]]], grad_fn=<NativeLayerNormBackward>)
And the attention heads:
4=CLS, 5=SEP
Always different for any given input but always the identical value on all heads.
tensor([[ 4, 13, 8, 6060, 5, 13, 2840, 350, 8, 6060, 5]])
(tensor([[[[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
...,
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909]],
[[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
...,
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909]],
[[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
...,
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909]],
...,
[[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
...,
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909]],
[[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
...,
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909]],
[[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
...,
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909]]]],
grad_fn=<SoftmaxBackward>), tensor([[[[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
[0.0909, 0.0909, 0.0909, ..., 0.0909, 0.0909, 0.0909],
Thanks a lot for any help! Much appreciated.