Hello, I need to create a translation dataset based on my text corpus. I only found this guide on how to create a dataset, Create a dataset
but it is negligible and there is no information on how to create a dataset of parallel texts
Thank you in advance
It varies depending on what exactly you need, but if you just create a folder like:
then you’ll be able to load the dataset and using Huggingface
datasets. For example:
from datasets import load_dataset
path = "path/to/translation_dataset"
train_set = load_dataset(path, split="train")
eval_set = load_dataset(path, split="test")
Beyond that, there could be pre-processing you need to do depending on what your examples look like and what model you’re using.