In many case of transformers`s fine tuning task. linear layer variable name used ‘lm_head’
what is that mean?
linear model head?
language model head?
in case of Wav2VecForCTC, used lm_head. but that sound weird to me.
Wav2Vec is not NLP models…!
name is wrong?
it’s language modeling head.
LM head is the language modelling head. The output of the transformer is a vector of size (batch_size, max_target_len, model_dimension). In the final step where you convert these transformer outputs to words, you first project them linearly and them apply softmax over it returning the probability of that position (i) in the target sequence being a certain word in the vocabulary. The layer where all of this happens is the LM head.
A little bit confused about how pretraining works
Is the LM head also used during pretraining? Like if pretraining is just trying to predict the next token, then the Conditional LM head would allow for this right?
The head is not used during pre-training in my understanding, but only afterward during fine-tuning. Here is what ChatGPT says given the question what is the “head” of a Large Language Model? (I checked this and I think it is a good explanation):
In the context of Large Language Models (LLMs) like GPT-3 or BERT, the term “head” refers to the additional layers or mechanisms added on top of the pre-trained base model to adapt it for specific tasks. These could range from classification layers for tasks like sentiment analysis to more complex architectures for tasks like machine translation or question answering.
Common Types of Heads:
Classification Head: For tasks like text classification, a fully connected (dense) layer is usually added to the output of the base model, followed by a softmax activation to produce class probabilities.
Regression Head: For regression tasks, a dense layer may be added without a softmax activation, designed to output a continuous value.
Token Classification Head: For named entity recognition or part-of-speech tagging, a token-level classifier is usually added to assign labels to each token in the input sequence.
Sequence-to-Sequence Head: For tasks like translation or summarization, a decoder mechanism may be added to generate a sequence of tokens as output.
Question-Answering Head: For QA tasks, the model might have two dense layers to predict the start and end positions of the answer span within the context text.
The specific architecture of the “head” would depend on the task it’s designed for. The idea is to fine-tune these additional layers on task-specific data to adapt the general language understanding capabilities of the LLM to the specific requirements of the task at hand.