Hi, I am more a CV guy and recently get interested in doing a nlp project.
In this project, one part might involve extracting sentence-level semantic representation from a pretrained model.
In computer vision, one standard way to extract feature of an image or a video snippet could be
using Resnet pretrained on Imagenet or I3D pretrained on Kinetics datasets, respectively.
I want to do the similar thing but in nlp domain. I wonder if there are some recommended models pretrained on specific dataset for me to try?
As far as my limited understanding, models trained on datasets which aim to to tell if two sentences are semantically equal could be a direction (e.g. QQP, STS-B ). But it needs a pair of sentences, my case is just feeding one sentence (or one block of sentences), not in a pair format. Any suggestion? Thanks!
Hi! IMO, Bert could be comparable to ResNet as the baseline. (you can use last_hidden_state variable of BertModel just like the global-pooled features of ResNet) Then, newer models like Roberta and many more could be comparable to EfficientNet etc.
Seems like you are looking for the Sentence Transformers library which trains Siamese BERT (etc.) networks on NLI data. That means that you can indeed pass one sentence to get a sentence embedding. They also have a few finetuned models that use cross-encoders instead. Those are obviously slower but lead to better performance on downstream tasks such as STSb.
Benchmark-wise speaking, I have some new idea : since SuperGLUE is one of the most difficult (multi-)task on language understanding. And since T5 is the current SOTA on this benchmark so we can also try embedding vectors from T5.
Previously, this may not be straightforward to extract (since T5 is encoder-decoder), but the latest master version of Huggingface now contains T5 encoder’s only model which we can directly extract the vector of the pretrained model. (Thanks to @agemagician) … So this is interesting choice IMO