Word-based tokenizers

Does hugging face has any word tokenizer? I’m looking for one that can separate the english words in a string without spaces nor punctuation.

Hi, to my knowledge Huggingface doesn’t have a tokenizer like that but you could do this using nltk. For example:

from nltk.tokenize import word_tokenize
import string

input = "This is a sentence. This is another sentence."

tokens = word_tokenize(input)

# = ['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']

# Filter out the punctuation
tokens = [token for token in tokens if token not in string.punctuation]

#  = ['This', 'is', 'a', 'sentence', 'This', 'is', 'another', 'sentence']