Hey all! Loving the updated
tokenizer docs and playing around with
normalizers at the moment. I’d like to update my article here about text preprocessing and using Datasets but I had a quick question:
.normalize_str works like so:
normalizer.normalize_str("Héllò hôw are ü?") # "Hello how are u?"
normalizer.normalize doesn’t seem to be documented? Is this something that maybe I should be using, or is more for internal use?
Just wondering if
normalizer.normalize_str is the most efficient way to use a
datasets.map or if
normalizer.normalize can do some magic? Is there a way to use
datasets.map to make things even faster?
Or if I added
normalizer to a pretrained tokenizer and then call the
tokenizer with datasets, will that also carry out the normalization before doing the tokenization?