Calculating Cosine Similarity with XLMRobertaModel Embeddings always leads to 0.99 score

Hello,

I wanted to calculate cosine similarity of different groups of texts (in English and Chinese). So I calculated the embedding with the XLMRobertaModel and then the cosine-similarity. However, all cosine similarities were about 0.99 (with some tiny variations).

I found that odd so I made the following mock example with three different texts: one from a LLM about soccer, one from a forum about random subjects (sorry for the vile content), and one with a a random selection of symbols and letters with no meaning. However, the cosine similarity between these three very different texts was again around 0.99 with some slight variations. So something must be wrong with my code, but I don’t know what :frowning: I also calculated the cosine similarity based on word2vec embeddings, and the results were very different. Could someone pleeease help me please? I currently have to write my Master’s thesis and want to include computational text analysis. But I am a beginner to transformers and Co. THANK YOU!

PS: These are the cosine similarity scores with XLMRobertaModel for the mock example provided below:
Cosine similarity between text1 and text2: 0.9978247880935669
Cosine similarity between text1 and text3: 0.9880638122558594
Cosine similarity between text2 and text3: 0.9914124011993408

These are the cosine similarity scores with word2vec for the same texts:
Word2Vec Cosine similarity between text1 and text2: 0.7184870839118958
Word2Vec Cosine similarity between text1 and text3: 0.0
Word2Vec Cosine similarity between text2 and text3: 0.0

**Mock Example with XLMRobertaModel: **

from transformers import XLMRobertaModel, XLMRobertaTokenizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import torch

model = XLMRobertaModel.from_pretrained('xlm-roberta-base', output_hidden_states=True)
tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

# Function to get embeddings
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    last_hidden_states = outputs.last_hidden_state
    input_mask_expanded = inputs['attention_mask'].unsqueeze(-1).expand(last_hidden_states.size()).float()
    sum_embeddings = torch.sum(last_hidden_states * input_mask_expanded, 1)
    sum_mask = input_mask_expanded.sum(1)
    sum_mask = torch.clamp(sum_mask, min=1e-9)
    mean_embeddings = sum_embeddings / sum_mask
    return mean_embeddings[0].numpy()
    


# Text examples  
text1 = """Manchester United 0-1 Bayern Munich: Frustration Reigns at Old Trafford

Champions League Group Stage - Matchday 6

A tense atmosphere gripped Old Trafford on Wednesday night as Manchester United hosted Bayern Munich in a crucial Champions League group stage clash. United, needing a win to secure qualification, fell short in a frustrating 0-1 defeat.

Early Caged Affair

The opening exchanges were a cagey affair, with both teams prioritizing defensive solidity. Bayern, boasting a potent attack with Leroy Sane and Kingsley Coman, struggled to break down United's well-organized backline marshaled by captain Harry Maguire. United, on the other hand, found attacking opportunities scarce, with Bruno Fernandes and Marcus Rashford largely isolated upfront.

Goalkeeper Heroics

The first half saw a flurry of saves from both goalkeepers. United's new signing, André Onana, pulled off a magnificent reaction stop to deny Joshua Kimmich from close range. At the other end, Manuel Neuer showcased his experience with a diving save to keep out a well-struck free-kick from Fernandes.

Coman's Clinical Finish

The deadlock was finally broken ten minutes into the second half. A moment of brilliance from Harry Kane unlocked the Bayern defense. His perfectly weighted pass found Coman in behind the United backline, and the French winger coolly slotted the ball past Onana for the game's only goal.

United's Fading Hopes

United poured forward in search of an equalizer, but their attacking efforts lacked the necessary composure. Substitute Anthony Martial showed glimpses of his talent but couldn't find the breakthrough. The final whistle blew to a chorus of groans from the disappointed home crowd.

Disappointment for Ten Hag

This defeat marks a significant setback for Manchester United under new manager Erik ten Hag. The Red Devils crash out of the Champions League at the group stage and will now have to focus on securing a top-four finish in the Premier League.

Man of the Match: Kingsley Coman (Bayern Munich)

Looking Ahead

For Bayern Munich, this win secures their place in the knockout stages. Manchester United, meanwhile, will be left to rue several missed opportunities and a lack of cutting edge in attack. The pressure is now on ten Hag to turn things around and deliver results in the domestic league."""





text2 = """The entire point of nuking IP count was to let tourist shitposters like that dipshit to thrive and derail threads scott free.
Wasn't even a year ago that buzzwords like "coomer" would get nuked off /a/ in a timely manner but even that went away, and it's never going to happen again. The IP count removal was one last "fuck you" to the users of the site and now every board truly is quickly becoming /v/ 2.0. Literal 1000% increase in shitposting and derailing since last week and it's seemingly remaining that way.
I mean shit, we just had fucking s*yjak threads that lived for several hours on here the other day. Just bring on the xitter and discord screencap and eceleb threads already so we really can be /v/ 2.0 in all of its MOON glory.
during her fight with the executioner he expected to be able to outlast her and during the first test of becoming a tier 1 mage the other students were surprised over her manapool. the series have also said that it takes a ton of mana to defend and she's been able to defend a long time.
i'll not say you're wrong but to me it seems like she has a lot more mana than others.
Why did xyl-forum suddenly get so triggered by ESLs? It wasn't like this 2 years ago. Is this the latest psyop?
"""



text3 = """hjkasdhjkasd. 438723467234. Hel813z723---BLUM."""

# Compute embeddings
embedding1 = get_embedding(text1).reshape(1, -1)
embedding2 = get_embedding(text2).reshape(1, -1)
embedding3 = get_embedding(text3).reshape(1, -1)

# Calculate cosine similarity
similarity_score_1_2 = cosine_similarity(embedding1, embedding2)[0][0]
similarity_score_1_3 = cosine_similarity(embedding1, embedding3)[0][0]
similarity_score_2_3 = cosine_similarity(embedding2, embedding3)[0][0]

print(f'Cosine similarity between text1 and text2: {similarity_score_1_2}')
print(f'Cosine similarity between text1 and text3: {similarity_score_1_3}')
print(f'Cosine similarity between text2 and text3: {similarity_score_2_3}')