Is there a function that does tokenizer.tokenize(‘text’) except on a batch? (ie return tokens rather than ids)
You mean like this?
tokens = tokenizer.tokenize("This is the extraction of tokens.") print(tokens) ['This', 'is', 'the', 'extraction', 'of', 'token', '##s', '.']
yes but for a batch of sequences, like tokenizer.batch_tokenize(batch_size * sequences)
returns a batch of tokenized sequences ( not the ids , just the split tokens)
AFAIK, the tokenizer does not have a built-in method for processing a batch of sequences into tokens.
You can achieve this by using a list comprehension:
sequences = [["This is the extraction of tokens.", "This is the second sentence"]] tokenized_sequences = [tokenizer.tokenize(sequence) for sequence in sequences] print(tokenized_sequences)
That’s what I’ve been using and it’s been causing major bottleneck issues, and I need the tokens to pass them to another function