For 1, you can look in the training tutorial where there is an example in PyTorch.
For 2, the head is initialized randomly since we are using a checkpoint of the base model, it would be pretrained if we used a checkpoint that has been fine-tuned for sequence classification like distilbert-base-uncased-finetuned-sst-2-english.
1 Like