Defining a custom dataset for fine-tuning translation

I’m a first time user of the huggingface library. I am struggling to convert my custom dataset into one that can be used by the hugginface trainer for translation task with MBART-50. The languages I am trying to train on are a part of the pre-trained model, I am simply trying to improve the model’s translation capability for that specific pair.

My data is in the form of two plaintext files, with each file containing the sentences in one of the languages (sentences at the same line number form a pair.) I have tried using the csv file data loader, but I am unsure of what the column names need to be so that the tokenizer can identify which is the source and which the target. Also, trainer.train() function needs to correctly pick up on the same detail. There is also the issue that I’m trying to train the model to translate back and forth using the same dataset and I am unsure on how to accommodate for that. In the simpletransformers library, with MT5, I could easily do it by specifying the task using a prefix column.

Can someone point me towards a tutorial to load custom datasets for a translation task, similar to the tutorial given for loading a custom dataset for the sequence classification task? I assume the method is the same for any of the translation oriented models, so any assistance would be appreciated. If such a tutorial exists for other tasks such as summarization and sentiment analysis, those would also be very helpful. I apologize if it is an overly trivial question.

3 Likes

This is exactly what I am trying to do too with no luck yet. Please let me know if you have found a way to do this.

1 Like

I am also struggling with this, it is not working for me. Can someone help me figure out what is going wrong with this code?

train_inputs = tokenizer(src_strs, return_tensors="pt", padding=True, truncation=True, max_length=64)
with tokenizer.as_target_tokenizer():
  train_tgt = tokenizer(tgt_strs, return_tensors="pt", padding=True, truncation=True, max_length=64

val_inputs = tokenizer(val_src_strs, return_tensors="pt", padding=True, truncation=True, max_length=64)
with tokenizer.as_target_tokenizer():
  val_tgt = tokenizer(val_tgt_strs, return_tensors="pt", padding=True, truncation=True, max_length=64)

train_inputs['labels'] = train_tgt['input_ids']
val_inputs['labels'] = val_tgt['input_ids']

class PODataset(torch.utils.data.Dataset):
    def __init__(self, examples):
        self.examples=examples

    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.examples.items()}

    def __len__(self):
        return len(self.examples['input_ids'])

train_dataset = PODataset(train_inputs)
val_dataset = PODataset(val_inputs)

You would need to write you custom Dataset loading class.
Please refer to: Writing a dataset loading script — datasets 1.9.0 documentation
You don’t have to register your dataset, just create a MyDataset.py class that looks similar to the tutorial.

You can also take a look at the actual implementations of existing datasets: datasets/squad.py at master · huggingface/datasets · GitHub
What you’d need to do is define the Features in the _info() function relevant to your data, and modify generate_examples which holds logic in loading the data

2 Likes

Thanks for the reply. I’ll look into using this mode of dataset generation. For now, I’ve taken the easy way out and parsed my data into a form that the run_translation.py script in the examples understands. I’ve saved my data in the jsonl format.

For others hoping to use the same method, look into df.to_dict(orient=‘records’) and json dumping.

1 Like