I’d like to use portions of the tokenizer pipeline (Normalizer, Pre-tokenizer) separately for some initial preprocessing/cleaning, do some external functions for additional preprocessing, then hand back to (a new?) tokenizer pipeline for
- normalizer
- pre-tokenizer
– - custom (non-tokenizer pipeline) functions
– - tokenizer.normalizer
- tokenizer.pre-tokenizer
- tokenizer.tokenize
Is there a way to create a Tokenizer pipeline object that doesn’t tokenize? Or should I just do something like
nzr = normalizers.Sequence(...)
ptok = pre_tokenizer(...)
def custom_fn(text: str):
# custom preprocessing
...
return txt
cleaned = custom_fn(
ptok.pre_tokenize_str(
nzr.normalize_str(text)
)
)
Further, if I hope to apply these to a Huggingface Dataset, should I just map the function to the dataset?
my_ds = load_dataset(...)
nzr = normalizers.Sequence(...)
ptok = pre_tokenizer(...)
my_ds = my_ds.map(nzr.normalize_str)
my_ds = my_ds.map(ptok.pre_tokenize_str)