LM head is the language modelling head. The output of the transformer is a vector of size (batch_size, max_target_len, model_dimension). In the final step where you convert these transformer outputs to words, you first project them linearly and them apply softmax over it returning the probability of that position (i) in the target sequence being a certain word in the vocabulary. The layer where all of this happens is the LM head.
3 Likes