I am totally new to this NLP and transformer and attention.
I was playing with models sentence-transformers and want to explore more, but now I got stuck.
I have an input of BxKx768
which is my embedded features.
Is there a way to give them to a transformer(which has attention model ) and get the output of size BxM
where M can be any number?
I learned to do it when input is a sentence, but have no idea how to do it when input is features.
I guess what I am asking is how to give my input to a transform model and have my output.
Apologies in advance if it is a bad question.
If that is easy, are there different models to try, like e.g. in resnet we have resnet 18,50,etc, do have the same thing here?
The models in the HF library are focused on NLP, hence they have extra stuff related to language, such as an embeddings layer, positional and token types information as well as other model specific features.
If you want to build your own model using Transformer layers then perhaps you should look at https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html.
Regarding the last question, resnet18/50… are pre-trained models with different model sizes, see this post. With Transformers you can also change the size of the network by specifying different number of attention heads or layers. For instance bert-base model has 12-layers in depth and each has 12 attention heads, while bert-large is 24 layers deep and uses 16 attention heads, that’s why the embeddings they produce are different in size, 768 vs 1024, which has to do with the number of attention heads in this case. If you want to understand more I recommend this post. So I think the parallelism would be bert-base being a Resnet-18 and bert-large a Resnet-50.
You could use pretrained Language Modelling models listed here for language related tasks as you would with pre-trained Resnets for computer vision tasks, although there are more task-specific trained models you can explore here.
So as you can see there’s a lot happening, so welcome to the NLP world