Multilingual token, phrase and sentence representations for text similarity

Hello all

For some research of mine, I am looking for the best way to get sentence representations as well as phrase and word representations that will be used for text similarity. Specifically, I want to compare the representations of translated sentences, as well as their aligned individual words and word groups (phrases). I could just use something like mT5 or XLM-R and use the final hidden states of the subword units and pool them to create these representations, however my fear is that they are not well-suited for a text similarity task. This issue was also raised by the people over at SentenceTransformers in their paper, who propose to finetune LMs on STS and other tasks to get sentence representations that are actually meaningful in a text similarity context. I could try these models, but as far as I know they never do any token similarity tests - only sentence similarity.

So if you have any ideas, perhaps some previous research that you read, or a new model that was actually evaluated on segment and token similarity, then Iā€™d love to hear it!

Thanks in advance

Bram

1 Like