I need to preprocess some sentences so that an existing module can split them into words based on single spaces. Currently, I have:
import tokenizers.normalizers as tn
import tokenizers.pre_tokenizers as tp
normalizer = tn.Sequence([tn.NFD(), tn.StripAccents()])
pretokeniser = tp.Whitespace() # Combines WhitespaceSplit and Punctuation
def preprocess(line: str):
pretokens = pretokeniser.pre_tokenize_str(normalizer.normalize_str(line))
return " ".join([w for w,_ in pretokens])
The problem is that my language (Dutch) has hyphenation within words for some compounds, and these are split into separate words by this process. As an example:
Energie-efficiëntie, i.e. zuinig omgaan met stroomverbruik, wordt steeds belangrijker bij het trainen van transformer-architecturen – zoveel is zeker!
now becomes
Energie - efficientie , i . e . zuinig omgaan met stroomverbruik , wordt steeds belangrijker bij het trainen van transformer - architecturen – zoveel is zeker !
Is there a way to exclude hyphens as punctuation mark?