Identifying and getting right embeddings from the fine tuned BERT on domain specific data

Problem statement: We trained BERT model from scratch on domain specific data and we utilise that model’s embeddings to find the similarity between the paragraphs and sentences. In this case the similarity was high between non similar sentences and similar sentences.

After that we tried to fine-tune BERT model using binary classification as task specific fine-tuning on domain specific labelled data set. After fine-tuning we find the cosine similarity between non similar sentences to be very high.
We did 4 epochs with freezing BERT layers and unfreeze for 1 epoch

following is the code that I am using for fine tuning and finding similarity between sentences.
looking for some help, Am I finetuning in the right way ? What are the suggestions that why model is not learning the patterns/information even after fine tuning.

Following are details from model training:

//freeze BERT:
#loss     accuracy  val_loss      val_accuracy
0.059805  0.647953  0.023045      0.928364
0.042791  0.783043  0.015368      0.969022
0.036922  0.822253  0.013272      0.969022
0.035886  0.827223  0.012590      0.960310
//unfreeze BERT
0.038641  0.816376  0.011593      0.927396

Example:

sentence1 =  'required java developer with 5 to 8 years of experience'
sentence3 =  'Blue fox fell out of the sky'

BERT = TFBertModel.from_pretrained(path,from_pt=True)

def get_bert_embedding(sentence):
    encoded_op=tokenizer.encode_plus(sentence)
    input_ids=encoded_op['input_ids']
    attention_mask=encoded_op['attention_mask']
    input_ids=np.array(input_ids)
    input_ids=np.reshape(input_ids,(len(input_ids),1))
    attention_mask=np.array(attention_mask)
    attention_mask=np.reshape(attention_mask,(len(attention_mask),1))
    x=BERT(input_ids,attention_mask)
    return x[0][:,0,:][-1]


def build_model(transformer, max_length=params['MAX_LENGTH']):
    weight_initializer = tf.keras.initializers.GlorotNormal(seed=params['RANDOM_STATE']) 
    input_ids_layer = tf.keras.layers.Input(shape=(max_length,), 
                                            name='input_ids', 
                                            dtype='int32')
    input_attention_layer = tf.keras.layers.Input(shape=(max_length,), 
                                                  name='input_attention', 
                                                  dtype='int32')
    last_hidden_state = transformer([input_ids_layer, input_attention_layer])[0]
    cls_token = last_hidden_state[:, 0, :]
    D1 = tf.keras.layers.Dropout(params['LAYER_DROPOUT'],
                                 seed=params['RANDOM_STATE']
                                )(cls_token)
    X = tf.keras.layers.Dense(256,
                              activation='relu',
                              kernel_initializer=weight_initializer,
                              bias_initializer='zeros'
                              )(D1)
    D2 = tf.keras.layers.Dropout(params['LAYER_DROPOUT'],
                                 seed=params['RANDOM_STATE']
                                )(X)
    X = tf.keras.layers.Dense(32,
                              activation='relu',
                              kernel_initializer=weight_initializer,
                              bias_initializer='zeros'
                              )(D2)
    D3 = tf.keras.layers.Dropout(params['LAYER_DROPOUT'],
                                 seed=params['RANDOM_STATE']
                                )(X)
    output = tf.keras.layers.Dense(1, 
                                   activation='sigmoid',
                                   kernel_initializer=weight_initializer, 
                                   bias_initializer='zeros'
                                   )(D3)
    model = tf.keras.Model([input_ids_layer, input_attention_layer], output)
    model.compile(tf.keras.optimizers.Adam(lr=params['LEARNING_RATE']), 
                  loss=focal_loss(),
                  metrics=['accuracy'])
    return model,transformer


def focal_loss(gamma=params['FL_GAMMA'], alpha=params['FL_ALPHA']):
    def focal_loss_fixed(y_true, y_pred):
        pt_1 = tf.where(tf.equal(y_true, 1), y_pred, tf.ones_like(y_pred))
        pt_0 = tf.where(tf.equal(y_true, 0), y_pred, tf.zeros_like(y_pred))
        return -K.mean(alpha * K.pow(1. - pt_1, gamma) * K.log(pt_1)) - K.mean((1 - alpha) * K.pow(pt_0, gamma) * K.log(1. - pt_0))
    return focal_loss_fixed