Defining a custom dataset for fine-tuning translation

raghavmallampalli · June 17, 2021, 6:31am

I’m a first time user of the huggingface library. I am struggling to convert my custom dataset into one that can be used by the hugginface trainer for translation task with MBART-50. The languages I am trying to train on are a part of the pre-trained model, I am simply trying to improve the model’s translation capability for that specific pair.

My data is in the form of two plaintext files, with each file containing the sentences in one of the languages (sentences at the same line number form a pair.) I have tried using the csv file data loader, but I am unsure of what the column names need to be so that the tokenizer can identify which is the source and which the target. Also, trainer.train() function needs to correctly pick up on the same detail. There is also the issue that I’m trying to train the model to translate back and forth using the same dataset and I am unsure on how to accommodate for that. In the simpletransformers library, with MT5, I could easily do it by specifying the task using a prefix column.

Can someone point me towards a tutorial to load custom datasets for a translation task, similar to the tutorial given for loading a custom dataset for the sequence classification task? I assume the method is the same for any of the translation oriented models, so any assistance would be appreciated. If such a tutorial exists for other tasks such as summarization and sentiment analysis, those would also be very helpful. I apologize if it is an overly trivial question.

Preetha · June 22, 2021, 11:36am

This is exactly what I am trying to do too with no luck yet. Please let me know if you have found a way to do this.

mooveover · July 1, 2021, 5:33pm

I am also struggling with this, it is not working for me. Can someone help me figure out what is going wrong with this code?

train_inputs = tokenizer(src_strs, return_tensors="pt", padding=True, truncation=True, max_length=64)
with tokenizer.as_target_tokenizer():
  train_tgt = tokenizer(tgt_strs, return_tensors="pt", padding=True, truncation=True, max_length=64

val_inputs = tokenizer(val_src_strs, return_tensors="pt", padding=True, truncation=True, max_length=64)
with tokenizer.as_target_tokenizer():
  val_tgt = tokenizer(val_tgt_strs, return_tensors="pt", padding=True, truncation=True, max_length=64)

train_inputs['labels'] = train_tgt['input_ids']
val_inputs['labels'] = val_tgt['input_ids']

class PODataset(torch.utils.data.Dataset):
    def __init__(self, examples):
        self.examples=examples

    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.examples.items()}

    def __len__(self):
        return len(self.examples['input_ids'])

train_dataset = PODataset(train_inputs)
val_dataset = PODataset(val_inputs)

tadf · July 6, 2021, 5:20am

You would need to write you custom Dataset loading class.
Please refer to: Writing a dataset loading script — datasets 1.9.0 documentation
You don’t have to register your dataset, just create a MyDataset.py class that looks similar to the tutorial.

You can also take a look at the actual implementations of existing datasets: datasets/squad.py at master · huggingface/datasets · GitHub
What you’d need to do is define the Features in the _info() function relevant to your data, and modify generate_examples which holds logic in loading the data

raghavmallampalli · July 10, 2021, 10:13am

Thanks for the reply. I’ll look into using this mode of dataset generation. For now, I’ve taken the easy way out and parsed my data into a form that the run_translation.py script in the examples understands. I’ve saved my data in the jsonl format.

For others hoping to use the same method, look into df.to_dict(orient=‘records’) and json dumping.

Topic		Replies	Views
Pre-Training MBART/MBART50 from Scratch in HuggingFace Models	1	571	January 7, 2024
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12850	February 12, 2024
Loading custom audio dataset and fine-tuning model Beginners	6	3245	December 12, 2023
Prakash Hinduja Geneva, Switzerland - How to fine-tune a model on custom dataset in HF? Beginners	2	47	June 6, 2025
DataCollator for training mbart50 for translation with custom dataset Beginners	0	346	June 24, 2021

Defining a custom dataset for fine-tuning translation

Related topics