Multilingual token, phrase and sentence representations for text similarity

BramVanroy · January 13, 2021, 8:17am

Hello all

For some research of mine, I am looking for the best way to get sentence representations as well as phrase and word representations that will be used for text similarity. Specifically, I want to compare the representations of translated sentences, as well as their aligned individual words and word groups (phrases). I could just use something like mT5 or XLM-R and use the final hidden states of the subword units and pool them to create these representations, however my fear is that they are not well-suited for a text similarity task. This issue was also raised by the people over at SentenceTransformers in their paper, who propose to finetune LMs on STS and other tasks to get sentence representations that are actually meaningful in a text similarity context. I could try these models, but as far as I know they never do any token similarity tests - only sentence similarity.

So if you have any ideas, perhaps some previous research that you read, or a new model that was actually evaluated on segment and token similarity, then I’d love to hear it!

Thanks in advance

Bram

Topic		Replies	Views
I don't understand the difference between asymmetric retrieval, sentence similarity, and semantic search Beginners	2	6178	July 28, 2023
Guidance on Optimizing Text Similarity and Reporting with Transformers and Advanced NLP Techniques 🤗Transformers	0	35	November 7, 2024
Can Similarity Sentence Returns the Similarity Content? 🤗Transformers	0	324	April 27, 2023
Call for Participation: SemEval 2022 Task 2 Multilingual Idiomaticity Detection and Sentence Embedding Research	1	787	July 14, 2024
Matching original and translated words with MarianMT Models	1	1066	May 21, 2021

Multilingual token, phrase and sentence representations for text similarity

Related topics