How to create a dataset for translation

Hello, I need to create a translation dataset based on my text corpus. I only found this guide on how to create a dataset, Create a dataset
but it is negligible and there is no information on how to create a dataset of parallel texts

Thank you in advance

It varies depending on what exactly you need, but if you just create a folder like:

translation_dataset/
    train.jsonl
    eval.jsonl
    README.md

then you’ll be able to load the dataset and using Huggingface datasets. For example:

from datasets import load_dataset

path = "path/to/translation_dataset"
train_set = load_dataset(path, split="train")
eval_set = load_dataset(path, split="test")

Beyond that, there could be pre-processing you need to do depending on what your examples look like and what model you’re using.