Preprocessing data for custom tokenizer

antoine2323231 · October 21, 2022, 9:12am

I am training a custom tokenizer on is one long string. Can I parse it directly to a tokenizer model and it would use the EOS token when it is a “.” for example so that it understand the sentence transition? My question is: I do not need to split it into lines, correct?

Topic		Replies	Views
Building a custom Java tokenizer 🤗Tokenizers	0	625	February 4, 2024
Preprocessing raw text 🤗Tokenizers	2	595	October 26, 2022
Create a simple tokenizer 🤗Tokenizers	0	420	February 14, 2023
How to customize behavior of added special tokens in a pretrained tokenizer? Intermediate	0	605	May 5, 2021
Pushing a custom tokenizer to the hub Beginners	0	333	April 14, 2023

Preprocessing data for custom tokenizer

Related topics