About training a tokenizer from scratch

duynguyen236 · June 9, 2022, 8:24am

Hi HunggingFace Community.
I am trying to train a tokenizer from scratch for machine translation problem. This machine translation model translates from English to Khmer. As far as I understand, the tokenizer for machine translation needs to be able to encode English documents and decode Khmer documents.
I have a few questions to ask:
+) Can I train that tokenizer from scratch based on the data that I use for the translation problem?
+) If I have a word segmentation for Khmer, how can I use it if I want to train a BPE tokenizer?
+) Is there a way to use HuggingFace’s Trainer, skipping the pre-processing step, if I want to train on a dataset whose sentences have split into lists of words?
Thank you!

BramVanroy · June 9, 2022, 9:20am

I don’t quite understand your last questions. The whole purpose of BPE is that you do not work with a word list (which is prone to out-of-vocabulary issues when a word is not in your list) but that you instead use subword units that can be used to generate a much larger vocabulary. You can learn how to train a tokenizer here.

Topic		Replies	Views
Training a tokenizer Beginners	1	444	August 3, 2022
Training sentencePiece from scratch? 🤗Tokenizers	8	19210	December 19, 2023
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023
Loading BPE modeled Tokenizer results in empty tokenizer 🤗Tokenizers	0	327	April 15, 2024
Domain adaptation of Language Model and Tokenizer Beginners	8	2852	June 17, 2024

About training a tokenizer from scratch

Related topics