Hi @petarulev , whatever you want to do is:
Preprocess → Tokenization → Model
Then, I’ll suggest you using HF Datasets at first, so the code should be:
# Suppose I have a dataset in .txt format, and the context is:
#
# C1CCCC1\n
# C1NNCC1\n
# ...and so on
# Step1: load datasets with HF Datasets
from datasets import load_dataset
ds = load_dataset(
"text", data_files={'train': ['path/to/data_1', ['path/to/data_2']}
)
# Step2: use .map() method
def preprocess(example):
# You can find out that each line in dataset file contain
# unwanted breaking line symbol '\n', so i decide to remove it
return example['text'] = example['text'].rstrip()
processed_ds = ds.map(preprocess, num_proc=4) # set `num_proc` to speed up!
# Step3: tokenization
tokenizer = ...
And this should be the way you can try.