The output for each of BERT’s hidden states comes from LayerNorm. I expected the standard deviation of the hidden states to be near 1 because of this normalization. However, when I inspect the hidden states, the values are quite small. Why would this be the case?
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs, output_hidden_states=True)
for hidden_state in outputs['hidden_states']:
print(f'Hidden State {hidden_state}')
print(f'Hidden state std: {hidden_state.std(dim=-1)}')
print()
Hidden State tensor([[[ 0.1664, -0.0541, -0.0014, …, -0.0811, 0.0794, 0.0155],
[-0.4229, 0.1071, -0.3010, …, 0.0352, -0.3372, 0.2603],
[ 0.5254, 0.1029, -0.0767, …, -0.6114, -0.2440, 0.2591],
…,
[ 0.2794, 0.0381, -0.0276, …, 0.1147, -0.0178, -0.0976],
[ 0.0204, 0.4912, 0.1750, …, 0.4872, -0.2833, -0.0511],
[ 0.1736, -0.1560, 0.0525, …, 0.3813, 0.1285, 0.1339]]],
grad_fn=NativeLayerNormBackward0
Hidden state std: tensor([[0.2599, 0.3309, 0.2946, 0.3154, 0.3333, 0.3045, 0.3281, 0.2870]],
grad_fn=StdBackward0)
…
The following colab notebook can be used to reproduce the results: BERT hidden state standard deviation - Colaboratory (google.com).