Transformer domain Adaptation

Hi there,
I am trying to improves the performance of the transformer models following ideas such as BERT–>LawBERT.

I’ve been following the key step idea (a. load bert model/tokenizer, b. dataselection from specific corpus and vocab augmentation using new token specific to the field, c. Domain pre-training using new tokenizer specific to my vocabulary).

However when I look at the embeding similarity of my new tokens I find they are worse than standard embeding similarity of the BERT model. So my question is why am I doing wrong ?

To be very specific, I tried to follow this tutorial:

which extends the idea of Domain Adaptation Components — Transformers Domain Adaptation 0.3.0 documentation

After the trainer.train() step I should have find the new tokens to have an embeding similarity much closer but it is not the case. Can anyone help me understand why it isn’t the case ?

Many thanks,
ps: what I mean by embeding similarity can be seen using this function:
best_checkpoint = ‘./results/domain_pre_training/checkpoint-150’

import sklearn
import seaborn as sns
def word_proximity_comparison(model = da_model,tokenizer = da_tokernizer,wordlist = [ ‘biology’,‘genomic’, ‘transcriptional’, ‘tyrosine’, ‘phosphorylation’, ‘mutagenesis’, ‘homology’,‘chocolat’,‘car’,‘water’,‘bike’]):

Function to compute cosine distance matrix of the embeddings of different words to visualise the quality of our embeddings 
out = tokenizer(wordlist,
          max_length=3,padding="max_length", truncation=True,return_tensors='pt')

layers =  [i for i in model.parameters()]

plt.subplots(figsize = (8,5))
cosine_distance_matrix = sklearn.metrics.pairwise.cosine_distances(X =layers[0][out['input_ids'][:,1]].detach().cpu().numpy() )
sns.heatmap(pd.DataFrame(cosine_distance_matrix,columns = wordlist,index = wordlist),cmap='Blues',annot = cosine_distance_matrix ,vmax=1,vmin = 0 )
plt.title('Heatmap of pretrained embdeddings similarities \n Lighter means closer in the embedding space')