Hello, I need to create a translation dataset based on my text corpus. I only found this guide on how to create a dataset, Create a dataset but it is negligible and there is no information on how to create a dataset of parallel texts
Thank you in advance
It varies depending on what exactly you need, but if you just create a folder like:
translation_dataset/ train.jsonl eval.jsonl README.md
then you’ll be able to load the dataset and using Huggingface datasets. For example:
datasets
from datasets import load_dataset path = "path/to/translation_dataset" train_set = load_dataset(path, split="train") eval_set = load_dataset(path, split="test")
Beyond that, there could be pre-processing you need to do depending on what your examples look like and what model you’re using.