Fine-tuning NLLB model

Hi,

I am trying to fine tune NLLB model for Hindi-Dogri(Indian languages) translation. Referred the translation pipeline given in the page Translation - Hugging Face Course . My dataset looks like as given below:

DatasetDict({
train: Dataset({
features: [‘id’, ‘translation’],
num_rows: 6356
})
test: Dataset({
features: [‘id’, ‘translation’],
num_rows: 1589
})
})

Individual item looks like:

{‘id’: 5583,
‘translation’: {‘do’: ‘जिनेंगी अस मनुक्खें दे ज्ञान थमां सखाई दियें गल्लें च नेई लेकन पवित्र आत्मा थमां सखाई दियें गल्लें च पवित्र आत्मा आत्मिक ज्ञान कन्ने आत्मिक गल्लें गी खोलीखोली दसदा ऐ’,
‘hi’: ‘जिनको हम मनुष्यों के ज्ञान की सिखाई हुई बातों में नहीं परन्तु पवित्र आत्मा की सिखाई हुई बातों में आत्मा आत्मिक ज्ञान से आत्मिक बातों की व्याख्या करती है’}}

My script is:

tokenizer = NllbTokenizerFast.from_pretrained(
    "facebook/nllb-200-distilled-600M", src_lang="hin_Deva", tgt_lang="dog_Deva"
)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
training_args = Seq2SeqTrainingArguments(
    output_dir="hi_do_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

When calling trainer.train() getting :

The following columns in the training set don't have a corresponding argument in M2M100ForConditionalGeneration.forward and have been ignored: translation, id. If translation, id are not expected by M2M100ForConditionalGeneration.forward, you can safely ignore this message.

What this message means? Is the data format is wrong? What data format required for nllb?

From looking at the code, for few functions only parameters mentioned in function signature are taken and passed to the function, e.g. code like:

inspect.signature(self.prepare_inputs_for_generation).parameters.keys()

So, the parameters which got filtered out are sometimes logged as mentioned. I believe it can be safely ignored.