Save tokenizer with argument

lianghsun · October 26, 2022, 9:43am

Hi @petarulev , whatever you want to do is:

Preprocess → Tokenization → Model

Then, I’ll suggest you using HF Datasets at first, so the code should be:

# Suppose I have a dataset in .txt format, and the context is:
# 
# C1CCCC1\n
# C1NNCC1\n
# ...and so on

# Step1: load datasets with HF Datasets
from datasets import load_dataset
ds = load_dataset(
    "text", data_files={'train': ['path/to/data_1', ['path/to/data_2']}
)

# Step2: use .map() method
def preprocess(example):
    # You can find out that each line in dataset file contain
    # unwanted breaking line symbol '\n', so i decide to remove it
    return example['text'] = example['text'].rstrip()

processed_ds = ds.map(preprocess, num_proc=4) # set `num_proc` to speed up!

# Step3: tokenization
tokenizer = ...

And this should be the way you can try.

Topic		Replies	Views
Writing custom tokenizer and wrapping it in tokenizer object 🤗Tokenizers	2	802	June 26, 2023
Questions re: Tokenizer pipeline composability / reuse outside of the HF ecosystem 🤗Tokenizers	0	216	December 18, 2023
Simple Save/Load of tokenizer not working 🤗Transformers	2	1668	November 4, 2020
Help defining tokenizer 🤗Tokenizers	0	283	April 28, 2023
Fine tune a saved model with custom tokenizer 🤗Transformers	3	2976	December 15, 2020

Save tokenizer with argument

Related topics