How to force LineByLineTextDataset split text corpus by words rather than symbols

Based on Simple WordLevelTokenizer · Issue #244 · huggingface/tokenizers · GitHub question I’m trying to complete my request to use WordLevel tokenizer with roberta transformers model. My vocabulary containts numbers as string and special tokens. I have some issue and I can localize what is wrong - but don’t know how to fix it. The situation is following:

tokenizer = RobertaTokenizerFast.from_pretrained("wordpiece", max_len=num_secs_max)
dataset = LineByLineTextDataset(

I see that LineByLineTextDataset splits numbers on separate digits - and it is wrong for me. I see that it is result of the tokenizer.batch_encode_plus working. I have found the advice that I need to add is_split_into_words = True parameter when construct RobertaTokenizerFast - but I didn’t have success. Please explain me how split my corpus by words not symbols…