Hi!
I try to implement neural machine translition model from scratch and now I choose tokenizer for languages.
I read what tokenizer has special Unicode Normalizer (NFC, NFD, etc.). I have a few questions for this normalizers
- Do I need to use Unicode normalizers for German or will any other one be suitable?
- Is there any additional information on unicode normalizers?
Thanks a lot for your help!