What does `tokenizers.normalizer.normalize` do?

Hey all! Loving the updated tokenizer docs and playing around with normalizers at the moment. I’d like to update my article here about text preprocessing and using Datasets but I had a quick question:

I know .normalize_str works like so:

normalizer.normalize_str("Héllò hôw are ü?")
# "Hello how are u?"

But normalizer.normalize doesn’t seem to be documented? Is this something that maybe I should be using, or is more for internal use?

Just wondering if normalizer.normalize_str is the most efficient way to use a normalizer with datasets.map or if normalizer.normalize can do some magic? Is there a way to use batched=True within datasets.map to make things even faster?

Or if I added normalizer to a pretrained tokenizer and then call the tokenizer with datasets, will that also carry out the normalization before doing the tokenization?

The Normalizer.normalize function is for our internal use in the pipeline: it doesn’t take a string but a string with the offsets, as it adds some functionality to keep track of the original offsets wrt to the original text and works in place (so you can combine several normalizers easily).

I don’t think normalize_str is less efficient. Also don’t think you can make things faster with batched=True here as it will just iterate on the elements of a batch.

Pinging @anthony and @Narsil that may have more insight

1 Like

Yes that is exactly the way to go. To get maximum efficiency you want to avoid round trips between python and Rust as much as possible. We are in the process of getting a documentation out that will hopefully improve that.

Just a gist here, the de-facto method to create a new Tokenizer in 0.9.0 is:

from tokenizers import Tokenizer, models, normalizers, ....

tokenizer = Tokenizer(models.BPE())
tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = ... # some pre tokenizer
tokenizer.decoder = ...
tokenizer.processor = ...
etc...

# Optional training phase, not required 
# if you loaded from an existing model.

trainer = trainers.BpeTrainer(vocab_size=80000
tokenizer.train(trainer, ["myfile.txt"])

# Training

tokenizer.encode("my string").tokens

Then everything will be run within rust, and should get you the best speedup possible.
If you save your tokenizer with tokenizer.save('mytokenizer.json') it should save the whole tokenization pipeline. Its another way of checking the various options of existing tokenizers.

1 Like

Thanks @sgugger, @Narsil great to know.

@Narsil can I add a normalizer to a pretrained tokenizer and expect the same behaviour? Do pretrained tokenizers have their own normalizers that I might over-write by accident?

Would something like the below work as expected or would I be interfering with what the pretrained tokenizer does?

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer.normalizer = normalizers.NFKC()
tokenizer('Am I working?')

It would not work as is.

By default from_pretrained does not use (yet) the fast tokenizers (from the tokenizers library).

What you want to do is

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)

# The actual tokenizer from `tokenizers` is tokenizer._tokenizer as we still need a small python wrapper currently

tokenizer._tokenizer.normalizer = normalizers.NFKC()
tokenizer.tokenize("Am I working?")

Keep in mind though that this works on master. On transformers 3.3.1 it would be tokenizer._tokenizer._tokenizer.normalizer.

If you want to add a new normalizer, you would have to recreate the actual sequence (bear in mind that normalizers order has importance, so you want to carefully add it at the correct place for your use case)

Ok, so one would have to go to the source code and find the normalizers used and then replicate those? I guess this is a good starting point:

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)

tokenizer._tokenizer.normalizer

Output: <tokenizers.normalizers.BertNormalizer at 0x7fc7e939b2f0>

Would it be worth exposing more information about the normalizer used with the pretrained tokenizers (e.g. with a .normalizer or .config attribute)?

tokenizer.add_normalizers would be ideal, but I guess its not going to be built for an audience of one :slightly_smiling_face:

Just trying to think how to leverage the parallization power of the tokenizer with easy, off-the-shelf processing functions, without folks having to go digging into the source code

(cc @anthony)

1 Like