Does hugging face has any word tokenizer? I’m looking for one that can separate the english words in a string without spaces nor punctuation.
thanks
Hi, to my knowledge Huggingface doesn’t have a tokenizer like that but you could do this using nltk
. For example:
from nltk.tokenize import word_tokenize
import string
input = "This is a sentence. This is another sentence."
tokens = word_tokenize(input)
# = ['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']
# Filter out the punctuation
tokens = [token for token in tokens if token not in string.punctuation]
# = ['This', 'is', 'a', 'sentence', 'This', 'is', 'another', 'sentence']