Batch tokenize (split into tokens, without processing)

Is there a function that does tokenizer.tokenize(‘text’) except on a batch? (ie return tokens rather than ids)

You mean like this?

tokens = tokenizer.tokenize("This is the extraction of tokens.")
['This', 'is', 'the', 'extraction', 'of', 'token', '##s', '.']

yes but for a batch of sequences, like tokenizer.batch_tokenize(batch_size * sequences)
returns a batch of tokenized sequences ( not the ids , just the split tokens)

AFAIK, the tokenizer does not have a built-in method for processing a batch of sequences into tokens.
You can achieve this by using a list comprehension:

sequences = [["This is the extraction of tokens.", 
              "This is the second sentence"]]
tokenized_sequences = [tokenizer.tokenize(sequence) for sequence in sequences]

That’s what I’ve been using and it’s been causing major bottleneck issues, and I need the tokens to pass them to another function