Way to fine tune pre trained model & get the embeddings

I require embedding of a sentence, and before using Bert model, I want to fine tune it for my specific domain and then get the embeddings (pooler_output from AutoModel)
This was my plan

  1. Finetune Masked Language model for the specific domain
  2. Load this fine tuned model into AutoModel, and then get the embeddings

However, the embeddings which I got from the second steo doesn’t seem to be correct
(I did a simple check - for a particular sentence, computed the cosine similarity of standalone step2 and the similarity of step2 preceeded by step1, this similarity was low, which I think it shouldn’t be, as I just did a fine tuning with a small dataset (500 data points)

Code -


#step1
from transformers import AutoModelForMaskedLM
bert_maskedML=AutoModelForMaskedLM.from_pretrained('bert-base-uncased')

from transformers import AutoTokenizer
bert_tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased")

import datasets
from datasets import load_dataset
df=pd.read_csv('/kaggle/input/inputs-n500/Regression_inputs_n500.csv')
dataset=load_dataset("csv", data_files='.fintech_inputs_n500.csv', split=datasets.Split.TRAIN)

tok_oup=dataset.map(lambda x:bert_tokenizer(x['text'],  padding='max_length'), batched=True)
tok_oup=tok_oup.remove_columns('text')
tok_oup=tok_oup.remove_columns('Unnamed: 0')
tok_oup.set_format("torch", columns=[ "input_ids", 'token_type_ids' ,'attention_mask'])


from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer, mlm=True, mlm_probability=0.15
)

from transformers import Trainer, TrainingArguments
training_args = TrainingArguments("test-trainer3")


trainer = Trainer(
    bert_maskedML,
    training_args,
    data_collator=data_collator,
    train_dataset=tok_oup
)


trainer.train()
trainer.save_model('./MaskedLM')

#step 2
from transformers import AutoModel
bertMasked_auto=AutoModel.from_pretrained('/kaggle/working/MaskedLM')


#step2 , if no step1
from transformers import AutoModel
bert_auto=AutoModel.from_pretrained('bert-base-uncased')


#comparison
inputs=[ "no you say that if i make a late payment there is no late fee"]
inputs=bert_tokenizer(inputs, padding='max_length', return_tensors='pt')


bert_masked_predctn=bertMasked_auto(**inputs)
bert_auto_predctn=bert_auto(**inputs)
from torch.nn import CosineSimilarity
cos = CosineSimilarity(dim=0, eps=1e-6)
auto_pooler=bert_auto_predctn['pooler_output']
Masked_auto_pooler=bert_masked_predctn['pooler_output']
cos(auto_pooler[0], Masked_auto_pooler[0])

I would also like to fine-tune a model for a specific domain in unsupervised manner, not from scartch. I understood that I need to finetune Masked Language model for the specific domain. And I did it, using IMDB example from here Дообучение модели маскированного языкового моделирования - Hugging Face NLP Course
I repeated also code given above in this topic, with a different dataset for my specific domain.
I got a cosine similarity close to zero in both cases.

  1. I compared pooler_output vectors for fine-tuned model and base model (bert-base-uncased).
  2. Took pooler_output for the same sentence and cosine similarity is about 0, that means the sentence is not similar.

I do not understand why I pooler_outputs ща 2 models become so different even after fine tuning on small dataset?

Maybe It is not good idea to compare pooler output 2 different models?

Also I read here that python - How to compare sentence similarities using embeddings from BERT - Stack Overflow
“BERT is not pretrained for semantic similarity, which will result in poor results, even worse than simple Glove Embeddings.”

The answer is the netxt: it is not correct to compare the embedding of two different models.

Let’s take two similar sentenses

inputs=[ “This is a good film”]
inputs1=[ “This is a great movie”]

inputs=bert_tokenizer(inputs, padding=‘max_length’, return_tensors=‘pt’)
inputs1=bert_tokenizer(inputs1, padding=‘max_length’, return_tensors=‘pt’)

import numpy as np
bert_masked_predctn=bertMasked_auto(**inputs)
bert_auto_predctn=bert_auto(**inputs)

bert_masked_predctn1=bertMasked_auto(**inputs1)
bert_auto_predctn1=bert_auto(**inputs1)

embedding here

auto_pooler=bert_auto_predctn[‘pooler_output’].detach().numpy()
Masked_auto_pooler=bert_masked_predctn[‘pooler_output’].detach().numpy()

auto_pooler1=bert_auto_predctn1[‘pooler_output’].detach().numpy()
Masked_auto_pooler1=bert_masked_predctn[‘pooler_output’].detach().numpy()

from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(auto_pooler, auto_pooler1))
print(cosine_similarity(Masked_auto_pooler, Masked_auto_pooler1))

Result will be
0.98 auto_pooler
1.0 masked_pooler
###############################

In case I take two unsimilar sentences like
inputs=[ “This is a good film”]
inputs1=[ “Cats like eating fish"]
cosine similarity of pooler_output of maked model is 0.98

for bert-base-uncased model is 0.68 that is still very high value, but more realistic

So pooler output gives very bad results when I need to use embedding to compare similarity a 2 small texts .

Sentense transformer gives much better results
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer(‘paraphrase-MiniLM-L12-v2’)
This is a great film This is a great movie cosine similarity score: 0.9463
This is a great movie Cats like fish cosine similarity score: : 0.0790

It is very unexpected results for me