About training a tokenizer from scratch

Hi HunggingFace Community.
I am trying to train a tokenizer from scratch for machine translation problem. This machine translation model translates from English to Khmer. As far as I understand, the tokenizer for machine translation needs to be able to encode English documents and decode Khmer documents.
I have a few questions to ask:
+) Can I train that tokenizer from scratch based on the data that I use for the translation problem?
+) If I have a word segmentation for Khmer, how can I use it if I want to train a BPE tokenizer?
+) Is there a way to use HuggingFace’s Trainer, skipping the pre-processing step, if I want to train on a dataset whose sentences have split into lists of words?
Thank you!

I don’t quite understand your last questions. The whole purpose of BPE is that you do not work with a word list (which is prone to out-of-vocabulary issues when a word is not in your list) but that you instead use subword units that can be used to generate a much larger vocabulary. You can learn how to train a tokenizer here.

1 Like