I am tokenizing wikipedia English and bookcorpus dataset, which is concatenated in one dataset for training GPT2. Tokenizing each of dataset is fast(i.e. not concatenated) but after concatenation, the tokenizing process is extremly slow at the end of tokenizing. I am using fast tokenizer option