Save tokenizer with argument

petarulev · May 1, 2022, 1:55pm

I am training my huggingface tokenizer on my own corpora, and I want to save it with a preprocessing step. That is, if I pass some text to it, I want it to apply the preprocessing and then tokenize the text, instead of explicitly preprocessing it before that. A good example is BERTweet: GitHub - VinAIResearch/BERTweet: BERTweet: A pre-trained language model for English Tweets (EMNLP-2020) and their tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True) (here normalization=True indicates that the input will be preprocessed according to some function). I want the same to apply when I train a tokenizer with a custom preprocessing function. My code is:

from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

def preprocess(text):
    return text

paths = [str(x) for x in Path('data').glob('*.txt')]

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=paths, vocab_size=50_000, min_frequency=2,
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

tokenizer.save_model('CustomBertTokenizer')

Now, when I load the tokenizer:

from transformers import RobertaTokenizerFast
sentence = 'Hey'
tokenizer = RobertaTokenizerFast.from_pretrained('CustomBertTokenizer')
tokenizer(sentence)

I want sentence to be preprocessed with the preprocess function, and then tokenized. So I want to pass like an argument : preprocessing=True, or something like that. How can I do it?

petarulev · May 8, 2022, 2:35pm

Any ideas guys?

lianghsun · October 26, 2022, 9:43am

Hi @petarulev , whatever you want to do is:

Preprocess → Tokenization → Model

Then, I’ll suggest you using HF Datasets at first, so the code should be:

# Suppose I have a dataset in .txt format, and the context is:
# 
# C1CCCC1\n
# C1NNCC1\n
# ...and so on

# Step1: load datasets with HF Datasets
from datasets import load_dataset
ds = load_dataset(
    "text", data_files={'train': ['path/to/data_1', ['path/to/data_2']}
)

# Step2: use .map() method
def preprocess(example):
    # You can find out that each line in dataset file contain
    # unwanted breaking line symbol '\n', so i decide to remove it
    return example['text'] = example['text'].rstrip()

processed_ds = ds.map(preprocess, num_proc=4) # set `num_proc` to speed up!

# Step3: tokenization
tokenizer = ...

And this should be the way you can try.

Topic		Replies	Views
Two approaches to training a tokenizer Beginners	0	976	March 6, 2023
How to save my tokenizer using save_pretrained? Beginners	5	28979	August 13, 2021
Is there a way to save a pre-compiled AutoTokenizer? 🤗Tokenizers	1	351	January 25, 2024
Is_pretokenized argument for tokenizer doesn't work? 🤗Transformers	1	1787	September 18, 2020
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3723	July 17, 2020

Save tokenizer with argument

Related topics