How to create a dataset for translation

Altabus · September 21, 2023, 1:24pm

Hello, I need to create a translation dataset based on my text corpus. I only found this guide on how to create a dataset, Create a dataset
but it is negligible and there is no information on how to create a dataset of parallel texts

Thank you in advance

dblakely · September 25, 2023, 10:44am

It varies depending on what exactly you need, but if you just create a folder like:

translation_dataset/
    train.jsonl
    eval.jsonl
    README.md

then you’ll be able to load the dataset and using Huggingface datasets. For example:

from datasets import load_dataset

path = "path/to/translation_dataset"
train_set = load_dataset(path, split="train")
eval_set = load_dataset(path, split="test")

Beyond that, there could be pre-processing you need to do depending on what your examples look like and what model you’re using.

Topic		Replies	Views
Create a dataset for translation 🤗Datasets	4	1386	December 14, 2023
How to make a translation dataset Beginners	3	2809	November 18, 2023
Defining a custom dataset for fine-tuning translation Beginners	4	5083	July 10, 2021
[NEWBY] Creating custom datasets to fine tune an existing model Beginners	0	300	November 4, 2022
Creating my own Dataset 🤗Transformers	2	3010	January 30, 2023

How to create a dataset for translation

Related topics