Fine-tuning BERT for code translation

Hello, I’m new to Hugging Face and to learn the ropes I’m trying to fine-tune BERT (base) to translate between two made-up programming languages but I’m running into a model issue.

I manually put together a dataset in JSON format with only a few samples:

[
{
“source_code”: “create_schema: 1,2,3”,
“target_code”: “schema: {col1, col2, col3}”
},
{
“source_code”: “create_schema: 4,5,6”,
“target_code”: “schema {col4, col5, col6}”
},
{
“source_code”: “create_schema: 7,8,9”,
“target_code”: “schema: {col7, col8, col9}”
}
]

my training code is below:

import pandas as pd

from datasets import Dataset, load_dataset

from transformers import BartForConditionalGeneration, BartTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoTokenizer

tokenizer = BartTokenizer.from_pretrained(“facebook/bart-base”)

Load dataset from file

def tokenize_function(example):

type or paste code here

return tokenizer(example[“source_code”], example[“target_code”], padding=“max_length”, truncation=True)

Load dataset and split it into training and validation sets

your_dataset = load_dataset(“json”, data_files={“train”: “pharos_nitro_pairs.json”}, field=“translation”)

train_dataset, val_dataset = your_dataset[“train”], your_dataset[“validation”]
train_tokenized_dataset = train_dataset.map(tokenize_function, batched=True)
val_tokenized_dataset = val_dataset.map(tokenize_function, batched=True)

model = BartForConditionalGeneration.from_pretrained(“facebook/bart-base”)

training_args = Seq2SeqTrainingArguments(
output_dir=“./results2”,
num_train_epochs=1,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
evaluation_strategy=“epoch”,
save_strategy=“epoch”,
logging_strategy=“steps”,
logging_steps=2,
learning_rate=5e-2,
weight_decay=0.01,
#warmup_steps=1,
save_total_limit=3,
)

trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_tokenized_dataset,
eval_dataset=val_tokenized_dataset,
tokenizer=tokenizer
)

This is the error that I’m getting, it seems like the model is unable to calculate the loss. Any help would be greatly appreciated

ValueError: The model did not return a loss from the inputs, only the following keys:
logits,past_key_values,encoder_last_hidden_state. For reference, the inputs it received are
input_ids,attention_mask.