Use Unicode blocks in regex (in Replace normalizer)

In python, I can use Unicode blocks in regex to filter out characters that I don’t want. For example :

import regex as re

sentence = "This is ènglish ㅕㅛㅇㄴㅁㅇ ㅇ!"
print(re.sub("[^\p{InBasicLatin}\p{InLatin-1Supplement}]", "", sentence)
# -> 'This is ènglish !'

(See List of Unicode Groups and block ranges)


I’m trying to do the same for my tokenizer. I try to declare a Replace normalizer, but it doesn’t seem to work :

from tokenizers.normalizers import Replace

sentence = "This is ènglish ㅕㅛㅇㄴㅁㅇ ㅇ!"
r = Replace(r"[^\p{InBasicLatin}\p{InLatin-1Supplement}]", "")

print(r.normalize_str(sentence))
# -> 'This is ènglish ㅕㅛㅇㄴㅁㅇㅇ!'

Therefore my question is : how can I filter out characters that are not in the Unicode blocks, like I did in pure-python ?

1 Like

From what I can see on the github repo, the Replace normalizer doesn’t do anything.