Fine-tuning multilingual BERT for sequence classification with Trainer API


Im fine-tuning multilingual bert for sequence classification as this [CLS] context [SEP] choice [SEP] [PAD] …

Im using the Trainer API

batch_size = 8 # probar con 32
num_train_epochs = 6
logging_steps = len(encoded_datasets['train']) // (2 * batch_size * num_train_epochs)

training_args = TrainingArguments(
    learning_rate=0.01, # {5e-5, 3e-5, 2e-5, 0.1}
    weight_decay=0.1, # {0, 0.01, 0.1}

def compute_metrics(eval_pred):
    metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    logitsTensors = torch.from_numpy(logits)
    print('eval_pred:', eval_pred)
    print('logits:', logits)
    print('labels:', labels)
    probabilities = torch.softmax(logitsTensors, dim=1)
    predictions = torch.argmax(probabilities, dim=1) # [1, 0, 0, 1...] axis=-1
    print('predictions: ', predictions)
    # return {"accuracy": np.mean(predictions == labels)}
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    train_dataset=encoded_datasets['train'], #encoded_datasets['train'],
    eval_dataset=encoded_datasets['validation'], #encoded_datasets['validation'],
    # id2label=id2label,
    # label2id=label2id,

I have experimented with different parameters but I almost always get 0.5 in accuracy and my training and validation loss stays almost the same.

What could this mean?

could you recommend me some guideline?

Thank you!

Can you share your training script?

Hi, thank you for your answer.

I don’t have any training script, I was hoping to train the model through Trainer as shown, without the Pytorch training loop. Is this approach correct?

accuracy always stuck in 50%, what could that mean?

can you try this and see what is your accuracy coming out to be?

    logits, labels = p
    logits = logits.tolist()
    labels = labels.tolist()
    pred = np.argmax(logits, axis=1).tolist()
    logits_tensor = torch.tensor(logits)
    prob = torch.nn.functional.softmax(logits_tensor, dim=-1).tolist()
    accuracy = accuracy_score(y_true=labels, y_pred=pred)

hi, thank you for your help.

I ran the experiment for 6 epochs, these are the results:

{‘eval_loss’: 0.6945369243621826, ‘eval_accuracy’: 0.4765625, ‘eval_runtime’: 1.1837, ‘eval_samples_per_second’: 108.136, ‘eval_steps_per_second’: 13.517, ‘epoch’: 1.0}

{‘eval_loss’: 0.6935781836509705, ‘eval_accuracy’: 0.5078125, ‘eval_runtime’: 1.1774, ‘eval_samples_per_second’: 108.718, ‘eval_steps_per_second’: 13.59, ‘epoch’: 2.0}

{‘eval_loss’: 0.6960821747779846, ‘eval_accuracy’: 0.5, ‘eval_runtime’: 1.1887, ‘eval_samples_per_second’: 107.677, ‘eval_steps_per_second’: 13.46, ‘epoch’: 3.0}

{‘eval_loss’: 0.6931933760643005, ‘eval_accuracy’: 0.5, ‘eval_runtime’: 1.1842, ‘eval_samples_per_second’: 108.091, ‘eval_steps_per_second’: 13.511, ‘epoch’: 4.0}

{‘eval_loss’: 0.6936441659927368, ‘eval_accuracy’: 0.484375, ‘eval_runtime’: 1.183, ‘eval_samples_per_second’: 108.197, ‘eval_steps_per_second’: 13.525, ‘epoch’: 5.0}

{‘eval_loss’: 0.6934036016464233, ‘eval_accuracy’: 0.484375, ‘eval_runtime’: 1.1699, ‘eval_samples_per_second’: 109.411, ‘eval_steps_per_second’: 13.676, ‘epoch’: 6.0}

{‘train_runtime’: 217.6431, ‘train_samples_per_second’: 31.758, ‘train_steps_per_second’: 3.97, ‘train_loss’: 0.7024143382355019, ‘epoch’: 6.0}

There is no any change in evaluation loss, and accuracy is stuck. I also notice that all logits are negatives as this (why is that):

logits_tensor: tensor([[-0.0935, -0.1828],
[-0.1098, -0.1829],
[-0.0728, -0.1879],
[-0.0940, -0.1868],
[-0.2301, -0.2504],
[-0.2266, -0.2678],
[-0.2459, -0.2656],
[-0.2376, -0.2704],
[-0.1885, -0.0431],
[-0.2165, -0.0214],

I think my validation partition is wrong formulated. I believe I have to transform back to the original format and maps the highest logit to the corresponding label which would be the prediction.

Maybe your data quality is bad. i keep getting 40% accuracy also so i improved my data now i am at 60%. how big is your training data? You will have to do an error analysis, make confusion matrix and start with the maximum classes being confused with each other.

Hi there,

Im fine-tuning a spanish version of bert with 1152 instances. I changed the training now with native Pytorch. I will explore in deep what you are telling me. I also believe i’m doing the evaluation wrong.

I am evaluating sentences for two options/candidates as a binary classification task, where my classes are 0 and 1.

My sentences transform in two with every option. I think I need to transform the sentences back to the original with the predicted label of the model, and evaluate in that validation set.

I do appreciate your advise :grin:

mister, can we talk about our experiments and share knowledges about LLMs and programming?
this is my email: musta.ali.saba @ g mail . com