How to make a translation dataset

I’m REALLY new to this AI making thing, and i think making models is easy (its probably not). i have made a simple conlang and i want to make a dataset for it so i can make a translator. I don’t really know how AI works, so I’m guessing that you will need at least 1K translations of text in order to make an an least kinda working dataset.

1 Like

I have tried it and… translating a lot of text is hard and my conlang cant have literal translations. it’s impossible. so whoops! all of my 1 month search was for nothing i guess. (: ⦚)

Wait no, i changed my mind.

Hi urassari :slight_smile:

As a Linguist by background your problem sounds interesting and ultimately boils down to a machine translation task and will likely involve a Sequence-to-Sequence approach.

I’m REALLY new to this AI making thing,

My first suggestion would be to go have a look at the HuggingFace NLP course here.

Your estimate of 1000 training examples is challenging because performance will depend on many many things. To name a few:

  • The size and type of model that you are training (the course linked above should shed some light on this)
  • Whether or not your conlang is similar to an existing language that the model you are finetuning has been pretrained on (assuming you are finetuning and not training from scratch)
  • Amount of data you have, data quality and ultimately how systematic this conlang of yours is (the model needs patterns to learn from)

Assuming your data is in a parallel format where you have sentences in English paired with sentences in your conlang, there should be a direct example you can follow in the HF course. Will this yield perfect results? No. But it will get you to start understanding the challenges and considerations of the problem, as well as exposing you to Datasets, Transformers and the HuggingFace workflow as a whole.

On a side note, you’ll likely get some performance using one of the large Chat/Completion models that are currently so popular. They tend to be good ‘One-shot’ or ‘Few-shot’ learners, meaning given a handful of examples and a fresh sentence in a prompt, the model can apply what it learned from the examples to the fresh sentence in real time without updating the little knobs and dials inside of the model through an explicit training process.

Bottom line though; the HF NLP material is excellent and you should definitely start there :slight_smile:

Best of luck!