In python, I can use Unicode blocks in regex to filter out characters that I don’t want. For example :
import regex as re
sentence = "This is ènglish ㅕㅛㅇㄴㅁㅇ ㅇ!"
print(re.sub("[^\p{InBasicLatin}\p{InLatin-1Supplement}]", "", sentence)
# -> 'This is ènglish !'
(See List of Unicode Groups and block ranges)
I’m trying to do the same for my tokenizer. I try to declare a Replace
normalizer, but it doesn’t seem to work :
from tokenizers.normalizers import Replace
sentence = "This is ènglish ㅕㅛㅇㄴㅁㅇ ㅇ!"
r = Replace(r"[^\p{InBasicLatin}\p{InLatin-1Supplement}]", "")
print(r.normalize_str(sentence))
# -> 'This is ènglish ㅕㅛㅇㄴㅁㅇㅇ!'
Therefore my question is : how can I filter out characters that are not in the Unicode blocks, like I did in pure-python ?