Clarification on heads, layers, training and output


I am using Tensorflow and I am doing a multi-class sentence classification using XLM-R base with a custom data set. (“jplu/tf-xlm-roberta-base”)

  1. It seems that for different tasks HF has different heads (such as, tokenclassifer, sequenceclassifier etc.) and differ by the head at the end of the base architecture. So, if I’m not using a fine-tuned checkpoint, I have to fine-tune that head as well. My first question is, can we change that head so that we can replace a different head (say a custom feed forward neural network or LSTM based network) for the same task while keeping the pre-trained base model? Can we know the details of the head that comes by default for a particular task (i.e:- the random initialized head for sequenceclassification)?

  2. Can we access the outputs from different layers of the model? Is there an API/method to do that in HuggingFace? Especially the outputs before the final head? and can we do pruning of different layers of the model? (say, on XLM-R Base)

  3. When we fine tune, is it a global fine-tuning (update the base model parameters + update head’s parameters) or just feature extraction (freeze original parameters of the base model checkpoint and only train the head). Can we perform the other one which is not the default fine tuning method?

  4. Final question, when using the Trainer.predict() output I get a np array. I guess the values in it are logits? Say for a multi-class classification of sentences (with 4 classes), these logits output means that softmax is not happening at the final head? How can we interpret this output? Below is my output.

PredictionOutput(predictions=array([[ 1.2991945e+00,  2.4860173e-01,  5.5320925e-01, -1.6669977e+00],
       [ 4.3599471e-01, -5.0883066e-02,  3.4532386e-01, -5.2039641e-01],
       [ 8.9458901e-01,  1.2760645e+00,  3.1270528e-01, -1.6415002e+00],
       [ 9.0530002e-01,  1.0148852e+00,  2.6518843e-01, -1.4662132e+00],
       [ 4.5786294e-01, -5.0590429e-02,  2.0140493e-01, -4.1767478e-01],
       [ 5.7495612e-01,  4.7848277e-02,  1.5834071e-01, -6.1066955e-01],
       [ 4.6566209e-01, -2.0567745e-02,  2.1055032e-01, -4.6179143e-01],
       [ 5.0190979e-01, -4.8803892e-02,  1.9314916e-01, -4.8067909e-01],
       [ 5.7928652e-01,  6.7762680e-02,  2.0994107e-01, -6.0617983e-01],
       [ 5.3082645e-01, -2.2240670e-02,  3.7937027e-01, -6.5518349e-01],
       [ 5.8990896e-01, -5.2324682e-02,  3.2848221e-01, -6.4274567e-01],
       [ 5.4325098e-01, -2.4219263e-02,  2.2602598e-01, -6.0041779e-01],
       [ 4.9988240e-01, -1.4048552e-03,  3.3386120e-01, -5.7529330e-01],
       [ 4.2276594e-01, -5.3270590e-02,  1.9273782e-01, -3.8076913e-01],
       [ 4.8189813e-01, -5.7544138e-02,  2.1740533e-01, -4.4707236e-01],
       [ 5.3467524e-01, -8.4268771e-02,  3.8555554e-01, -6.2313312e-01],
       [ 3.8383359e-01, -8.2594566e-02,  1.8413506e-01, -3.2918590e-01],
       [ 1.0014045e+00,  2.6587926e-02,  1.0125093e+00, -1.6064054e+00],
       [ 5.6751728e-01, -2.3115154e-02,  2.0180833e-01, -5.6251198e-01],
       [ 5.4358459e-01, -4.8401270e-02,  2.9657021e-01, -5.8822620e-01]],
      dtype=float32), label_ids=array([3, 2, 1, 1, 2, 1, 2, 0, 2, 0, 0, 0, 2, 2, 1, 0, 1, 2, 0, 0]), metrics={'eval_loss': 1.224756399790446, 'eval_accuracy': 0.5, 'eval_precision': 0.7441176470588236, 'eval_recall': 0.5})

Sorry, if this is too long. I thought asking these altogether as these are somewhat related.