Chapter 4 - Custom NER Model

We create a model using xlmr_model = XLMRobertaForTokenClassification.from_pretrained("xlm-roberta-base"). The model has layers like RobertaAttention, RobertaEmbeddings, etc. But if I instantiate XLMRobertaForTokenClassification using from transformers import XLMRobertaForTokenClassification, it has layers like XLMRobertaAttention, XLMRobertaEmbeddings etc. But both say ..from_pretrained("xlm-roberta-base") i.e. loading weights from the same checkpoint. Shouldn’t it give an error or something?

Point is that the two models (XLM-R and RoBERTa) have the exact same model architecture (while having been pretrained with different inputs - XLM-R has been pretrained by massively upscaling training data and with multilingual, rather than monolingual, texts). That’s specified in the book itself

Since XLM-R uses the same model architecture as RoBERTa, we’ll use RoBERTa as the base model, but augmented with settings specific to XLM-R.

as well as here. Also, the code underlying the XLM-R model implementation has different references to this (see here for instance).

Therefore, either having an underlying RobertaModel implementation or an underlying XLMRobertaModel implementation does not make any difference. Only point, when loading the "xlm-roberta-base" checkpoint, you’ll effectively load the XLM-RoBERTa (base) configuration and pretrained weights.
That’s the way I’ve explained it to myself!