I’m wondering if there is an easy way to tweak the individual components of a tokenizer. Specifically, I’d like to implement a custom normalizer and post-processor. Just to provide some context, I’m trying to train a Danish tokenizer. Danish has a lot of compound nouns (e.g., the Danish translation…

For anyone else looking, this can be done, and it’s answered in this question: [image] How to add additional custom pre-tokenization processing? 🤗Tokenizers I would like to add a few custom functions for pre-tokenization. For example, I would like to split numerical text …

Implementing custom tokenizer components (normalizers, processors)

🤗Tokenizers

saattrupdan November 30, 2021, 12:41pm 2

For anyone else looking, this can be done, and it’s answered in this question:

Topic		Replies	Views
How to add additional custom pre-tokenization processing? 🤗Tokenizers	6	5281	March 7, 2023
Custom PostProcessor? 🤗Tokenizers	0	932	November 10, 2022
What does `tokenizers.normalizer.normalize` do? 🤗Tokenizers	5	3604	October 12, 2020
Tokenizer post_processor help 🤗Tokenizers	1	1401	October 27, 2022
How to see contents of a normalizer 🤗Tokenizers	0	306	May 7, 2021

Implementing custom tokenizer components (normalizers, processors)

Related topics