Do I have to only tokens in Bert dataset for token classification

wehapi · January 18, 2024, 9:11am

Hello, I’ve started finetuning with BERT-cased and dataset I have used datasets from huggingface only.
They follow almost same structure by splitting text from the space.
Such, I am Raj Kumar
Result will be ['I','am','Raj','Kumar']

Will I get result like this only with tokens or I can use dataset format for token such as ["I", "am", "Raj Kumar"] without splitting text.

Basically what I will retrieve is human name, product names and other data which will be in >= 2 words.

Topic		Replies	Views
Access word piece tokens from BERT tokenized dataset 🤗Datasets	2	933	November 17, 2021
Doccano dataset for named entity recognition task using BERT Beginners	3	478	May 14, 2024
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6657	February 9, 2024
Sentence splitting 🤗Tokenizers	7	31846	September 15, 2022
Pretokenization of dataset for finetuning 🤗Datasets	4	57	May 31, 2025

Do I have to only tokens in Bert dataset for token classification

Related topics