Fine-tuning NLLB model

anjalyjayakrishnan · February 7, 2023, 7:21am

Hi,

I am trying to fine tune NLLB model for Hindi-Dogri(Indian languages) translation. Referred the translation pipeline given in the page Translation - Hugging Face Course . My dataset looks like as given below:

DatasetDict({
train: Dataset({
features: [‘id’, ‘translation’],
num_rows: 6356
})
test: Dataset({
features: [‘id’, ‘translation’],
num_rows: 1589
})
})

Individual item looks like:

{‘id’: 5583,
‘translation’: {‘do’: ‘जिनेंगी अस मनुक्खें दे ज्ञान थमां सखाई दियें गल्लें च नेई लेकन पवित्र आत्मा थमां सखाई दियें गल्लें च पवित्र आत्मा आत्मिक ज्ञान कन्ने आत्मिक गल्लें गी खोलीखोली दसदा ऐ’,
‘hi’: ‘जिनको हम मनुष्यों के ज्ञान की सिखाई हुई बातों में नहीं परन्तु पवित्र आत्मा की सिखाई हुई बातों में आत्मा आत्मिक ज्ञान से आत्मिक बातों की व्याख्या करती है’}}

My script is:

tokenizer = NllbTokenizerFast.from_pretrained(
    "facebook/nllb-200-distilled-600M", src_lang="hin_Deva", tgt_lang="dog_Deva"
)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
training_args = Seq2SeqTrainingArguments(
    output_dir="hi_do_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

When calling trainer.train() getting :

The following columns in the training set don't have a corresponding argument in M2M100ForConditionalGeneration.forward and have been ignored: translation, id. If translation, id are not expected by M2M100ForConditionalGeneration.forward, you can safely ignore this message.

What this message means? Is the data format is wrong? What data format required for nllb?

nitindub · July 20, 2023, 2:44pm

From looking at the code, for few functions only parameters mentioned in function signature are taken and passed to the function, e.g. code like:

inspect.signature(self.prepare_inputs_for_generation).parameters.keys()

So, the parameters which got filtered out are sometimes logged as mentioned. I believe it can be safely ignored.

Topic		Replies	Views
Fine tuning nllb model Beginners	0	772	February 1, 2023
Finetune a pretrained huggingface translation model on a new language pair Models	1	1037	January 12, 2024
Select Source and Target Langauge in multi-language translation models 🤗Transformers	1	375	August 14, 2024
Improving my fine-tuned model score Beginners	0	257	June 30, 2023
Fine-tuning an NLLB model for a new language 🤗Transformers	7	2724	January 12, 2025

Fine-tuning NLLB model

Related topics