Batch tokenize (split into tokens, without processing)

Sifal · October 28, 2023, 10:27am

Is there a function that does tokenizer.tokenize(‘text’) except on a batch? (ie return tokens rather than ids)

AIGeekProgrammer · October 28, 2023, 12:48pm

You mean like this?

tokens = tokenizer.tokenize("This is the extraction of tokens.")
print(tokens)
['This', 'is', 'the', 'extraction', 'of', 'token', '##s', '.']

Sifal · October 28, 2023, 1:17pm

yes but for a batch of sequences, like tokenizer.batch_tokenize(batch_size * sequences)
returns a batch of tokenized sequences ( not the ids , just the split tokens)

AIGeekProgrammer · October 28, 2023, 1:34pm

AFAIK, the tokenizer does not have a built-in method for processing a batch of sequences into tokens.
You can achieve this by using a list comprehension:

sequences = [["This is the extraction of tokens.", 
              "This is the second sentence"]]
tokenized_sequences = [tokenizer.tokenize(sequence) for sequence in sequences]
print(tokenized_sequences)

Sifal · October 28, 2023, 1:43pm

That’s what I’ve been using and it’s been causing major bottleneck issues, and I need the tokens to pass them to another function

Topic		Replies	Views
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4142	March 11, 2025
Issue with Extracting Word Ids from Batch Encoding Object Beginners	2	1012	November 1, 2022
Efficient detokenization method 🤗Transformers	3	2035	January 28, 2021
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6639	February 9, 2024
Question About XLNetTokenizer Beginners	1	318	October 21, 2022

Batch tokenize (split into tokens, without processing)

Related topics