The models in the HF library are focused on NLP, hence they have extra stuff related to language, such as an embeddings layer, positional and token types information as well as other model specific features.
If you want to build your own model using Transformer layers then perhaps you should look at https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html.
Regarding the last question, resnet18/50… are pre-trained models with different model sizes, see this post. With Transformers you can also change the size of the network by specifying different number of attention heads or layers. For instance bert-base model has 12-layers in depth and each has 12 attention heads, while bert-large is 24 layers deep and uses 16 attention heads, that’s why the embeddings they produce are different in size, 768 vs 1024, which has to do with the number of attention heads in this case. If you want to understand more I recommend this post. So I think the parallelism would be bert-base being a Resnet-18 and bert-large a Resnet-50.
You could use pretrained Language Modelling models listed here for language related tasks as you would with pre-trained Resnets for computer vision tasks, although there are more task-specific trained models you can explore here.
So as you can see there’s a lot happening, so welcome to the NLP world