I have a preprocessed dataset. The tokens are split by whitespace. So I need a very simple tokenizer to load this. Is there any advice about how to create this?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Help understanding how to build a dataset for language as with the old TextDataset | 7 | 12290 | October 6, 2021 | |
Preprocessing data for custom tokenizer | 0 | 247 | October 21, 2022 | |
Help defining tokenizer | 0 | 272 | April 28, 2023 | |
How do you tokenize one long string? | 0 | 276 | June 24, 2023 | |
Programmatic way to Tokenization on Custom Text Columns | 0 | 560 | June 27, 2022 |