How to make a translation dataset

urassari · November 12, 2023, 9:29am

I’m REALLY new to this AI making thing, and i think making models is easy (its probably not). i have made a simple conlang and i want to make a dataset for it so i can make a translator. I don’t really know how AI works, so I’m guessing that you will need at least 1K translations of text in order to make an an least kinda working dataset.

urassari · November 14, 2023, 7:34pm

I have tried it and… translating a lot of text is hard and my conlang cant have literal translations. it’s impossible. so whoops! all of my 1 month search was for nothing i guess. (: ⦚)

urassari · November 16, 2023, 9:29am

Wait no, i changed my mind.

jkswin · November 18, 2023, 4:56pm

Hi urassari

As a Linguist by background your problem sounds interesting and ultimately boils down to a machine translation task and will likely involve a Sequence-to-Sequence approach.

I’m REALLY new to this AI making thing,

My first suggestion would be to go have a look at the HuggingFace NLP course here.

Your estimate of 1000 training examples is challenging because performance will depend on many many things. To name a few:

The size and type of model that you are training (the course linked above should shed some light on this)
Whether or not your conlang is similar to an existing language that the model you are finetuning has been pretrained on (assuming you are finetuning and not training from scratch)
Amount of data you have, data quality and ultimately how systematic this conlang of yours is (the model needs patterns to learn from)

Assuming your data is in a parallel format where you have sentences in English paired with sentences in your conlang, there should be a direct example you can follow in the HF course. Will this yield perfect results? No. But it will get you to start understanding the challenges and considerations of the problem, as well as exposing you to Datasets, Transformers and the HuggingFace workflow as a whole.

On a side note, you’ll likely get some performance using one of the large Chat/Completion models that are currently so popular. They tend to be good ‘One-shot’ or ‘Few-shot’ learners, meaning given a handful of examples and a fresh sentence in a prompt, the model can apply what it learned from the examples to the fresh sentence in real time without updating the little knobs and dials inside of the model through an explicit training process.

Bottom line though; the HF NLP material is excellent and you should definitely start there

Best of luck!

Topic		Replies	Views
How to create a dataset for translation Beginners	1	469	September 25, 2023
Create a dataset for translation 🤗Datasets	4	1391	December 14, 2023
Defining a custom dataset for fine-tuning translation Beginners	4	5083	July 10, 2021
How to train a model on multiple datasets Beginners	1	2950	September 18, 2023
[NEWBY] Creating custom datasets to fine tune an existing model Beginners	0	301	November 4, 2022

How to make a translation dataset

Related topics