Use Unicode blocks in regex (in Replace normalizer)

astariul · November 29, 2022, 3:28am

In python, I can use Unicode blocks in regex to filter out characters that I don’t want. For example :

import regex as re

sentence = "This is ènglish ㅕㅛㅇㄴㅁㅇ ㅇ!"
print(re.sub("[^\p{InBasicLatin}\p{InLatin-1Supplement}]", "", sentence)
# -> 'This is ènglish !'

(See List of Unicode Groups and block ranges)

I’m trying to do the same for my tokenizer. I try to declare a Replace normalizer, but it doesn’t seem to work :

from tokenizers.normalizers import Replace

sentence = "This is ènglish ㅕㅛㅇㄴㅁㅇ ㅇ!"
r = Replace(r"[^\p{InBasicLatin}\p{InLatin-1Supplement}]", "")

print(r.normalize_str(sentence))
# -> 'This is ènglish ㅕㅛㅇㄴㅁㅇㅇ!'

Therefore my question is : how can I filter out characters that are not in the Unicode blocks, like I did in pure-python ?

braunagn · November 9, 2023, 10:06pm

From what I can see on the github repo, the Replace normalizer doesn’t do anything.

Topic		Replies	Views
Text preprocessing for fitting Tokenizer model 🤗Tokenizers	1	1370	October 25, 2022
RoBERTa Tokenizer supported characters 🤗Transformers	0	624	December 24, 2020
What does `tokenizers.normalizer.normalize` do? 🤗Tokenizers	5	3477	October 12, 2020
Encoding and then decodeing text is not equal 🤗Tokenizers	2	188	August 12, 2024
How do tokenizer(text_target=text) work 🤗Transformers	0	440	December 24, 2022

Use Unicode blocks in regex (in Replace normalizer)

Related topics