How to force LineByLineTextDataset split text corpus by words rather than symbols

Roman · August 27, 2021, 7:00pm

Based on Simple WordLevelTokenizer · Issue #244 · huggingface/tokenizers · GitHub question I’m trying to complete my request to use WordLevel tokenizer with roberta transformers model. My vocabulary containts numbers as string and special tokens. I have some issue and I can localize what is wrong - but don’t know how to fix it. The situation is following:

tokenizer = RobertaTokenizerFast.from_pretrained("wordpiece", max_len=num_secs_max)
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./optiver.txt",
    block_size=128,
)

I see that LineByLineTextDataset splits numbers on separate digits - and it is wrong for me. I see that it is result of the tokenizer.batch_encode_plus working. I have found the advice that I need to add is_split_into_words = True parameter when construct RobertaTokenizerFast - but I didn’t have success. Please explain me how split my corpus by words not symbols…

Topic		Replies	Views
Pre-Training From Scratch 🤗Transformers	0	1003	October 6, 2021
Pretraining RoBERTa from scratch breaks down when using tokenizer with smaller vocabulary Beginners	2	1677	March 7, 2021
Tokenized sequence lengths 🤗Tokenizers	6	2018	March 10, 2022
Question on splitting input sequence Beginners	3	5572	June 14, 2022
Further pre-train roberta model Beginners	1	1390	July 14, 2020

How to force LineByLineTextDataset split text corpus by words rather than symbols

Related topics