BERT pre-trained model is overfitting

I have trained my pre-trained BERT model from the Hugging Face library on the Jigsaw Toxic Comment Classification dataset to detect hateful comments. However, when I try to do infer with the positive sentences, it is giving me wrong results.

My accuracy comes between 97-98 % every time.
I have trained with
epochs: 4,5,6
LR: 2e-5, 3e-5, 5e-5
Pre-trained models used: bert_base_uncased, Roberta, XLM-R(for German and English combined dataset)
Previously my classification report was too low since the dataset was highly imbalanced. To overcome this I augmented the data by translating (ENG → DE → ENG, ENG-> FR-> ENG, ENG-> IT-> ENG). After augmenting, I am getting better Precision, recall and F1 score, however, when I try to infer a few positive comments it is giving me the wrong sentiment score.
For example: If I am giving input as: “You are a nice person”, the model infers it as either toxic or insult. I am not sure how to resolve this issue, any suggestion will be really helpful.

test_comment = “You are a nice person”

encoding = tokenizer.encode_plus(
test_comment,
add_special_tokens=True,
max_length=512,
return_token_type_ids=False,
padding=“max_length”,
return_attention_mask=True,
return_tensors=‘pt’,
)

_, test_prediction = trained_model(encoding[“input_ids”], encoding[“attention_mask”])
test_prediction = test_prediction.flatten().numpy()

for label, prediction in zip(LABEL_COLUMNS, test_prediction):
print(f"{label}: {prediction}")

Result:
toxic: 0.289602130651474
severe_toxic: 0.012312621809542179
obscene: 0.26335516571998596
threat: 0.0017053773626685143
insult: 0.54698246717453
identity_hate: 0.0013856851728633046