TokenClassification pipeline doing batch processing over a sequence of already tokenised messages

Is batch processing with the TokenClassification pipeline supported?

I have a fine-tuned model which performs token classification, and a tokenizer which was built as:

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

and this works fine in a pipeline when processing a single document/message:

nlp = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="first",  # max, none, simple, average
    binary_output=True,
    ignore_labels=[],
)

text = ["Hello", "this", "is", "a", "single", "tokenized", "message"]

for token in nlp(text):
	print(token)

[{'entity_group': 'LABEL_2', 'score': 0.07955505, 'word': 'Hello', 'start': 0, 'end': 5}]
[{'entity_group': 'LABEL_2', 'score': 0.06315145, 'word': 'this', 'start': 0, 'end': 4}]
[{'entity_group': 'LABEL_2', 'score': 0.08200004, 'word': 'is', 'start': 0, 'end': 2}]
[{'entity_group': 'LABEL_2', 'score': 0.07786057, 'word': 'a', 'start': 0, 'end': 1}]
[{'entity_group': 'LABEL_3', 'score': 0.056751117, 'word': 'single', 'start': 0, 'end': 6}]
[{'entity_group': 'LABEL_3', 'score': 0.10323574, 'word': 'tokenized', 'start': 0, 'end': 9}]
[{'entity_group': 'LABEL_3', 'score': 0.09412522, 'word': 'message', 'start': 0, 'end': 7}]

If I know try to pass a sequence os messages:

text = [
    ["Hello", "this", "is", "a", "single", "tokenized", "message"], 
    ["another", "tokenized", "message"], 
    ["short", "message"]
]

I was expecting I could do something like this:

for msg in nlp(text):
    for entity in nlp(msg):
	print(entity)

but I always end up with a ValueError:

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

even if I initialise the Pipeline like this:

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased", model_max_length=512, is_split_into_words=True, padding=True, truncation=True)

nlp = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="first"
    binary_output=True,
    ignore_labels=[],
)

I get the same ValueError.

So how can I use the TokenClassification pipeline to do batch processing over a sequence of already tokenised messages?

I’m starting to read the code of the pipeline to understand, but before I dig deeper can anyone tell me if this is even possible? Having the pipeline doing batch processing AND handling the padding/truncating for already tokenised input.

Thanks!
David

I think the problem here is that the pipeline doesn’t expect the input split into words, but the full sentence in a string. Then you can pass an array of strings, where each string is a sentence itself. In other words, when you pass the sentence split into words, it assumes that you are passing N sentences of just one word each.

Now I’d like to pass the input sentence already split into words to the pipeline (TokenClassificationPipeline), like we do when using the raw tokenizer with the parameter is_split_into_words. Is that even possible? Can anybody help?