Word-based tokenizers

adrian-123 · March 17, 2023, 6:10am

Does hugging face has any word tokenizer? I’m looking for one that can separate the english words in a string without spaces nor punctuation.
thanks

dblakely · March 17, 2023, 5:11pm

Hi, to my knowledge Huggingface doesn’t have a tokenizer like that but you could do this using nltk. For example:

from nltk.tokenize import word_tokenize
import string

input = "This is a sentence. This is another sentence."

tokens = word_tokenize(input)

# = ['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']

# Filter out the punctuation
tokens = [token for token in tokens if token not in string.punctuation]

#  = ['This', 'is', 'a', 'sentence', 'This', 'is', 'another', 'sentence']

Topic		Replies	Views
Using HuggingFace Tokenizers Without Special Characters 🤗Tokenizers	2	1961	November 2, 2022
Writing custom tokenizer and wrapping it in tokenizer object 🤗Tokenizers	2	805	June 26, 2023
How to decode with spaces? 🤗Tokenizers	0	1878	April 28, 2022
Create a simple tokenizer 🤗Tokenizers	0	422	February 14, 2023
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3842	July 17, 2020

Word-based tokenizers

Related topics