TokenClassification pipeline doing batch processing over a sequence of already tokenised messages

DavidSBatista · March 31, 2022, 8:57am

Is batch processing with the TokenClassification pipeline supported?

I have a fine-tuned model which performs token classification, and a tokenizer which was built as:

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

and this works fine in a pipeline when processing a single document/message:

nlp = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="first",  # max, none, simple, average
    binary_output=True,
    ignore_labels=[],
)

text = ["Hello", "this", "is", "a", "single", "tokenized", "message"]

for token in nlp(text):
	print(token)

[{'entity_group': 'LABEL_2', 'score': 0.07955505, 'word': 'Hello', 'start': 0, 'end': 5}]
[{'entity_group': 'LABEL_2', 'score': 0.06315145, 'word': 'this', 'start': 0, 'end': 4}]
[{'entity_group': 'LABEL_2', 'score': 0.08200004, 'word': 'is', 'start': 0, 'end': 2}]
[{'entity_group': 'LABEL_2', 'score': 0.07786057, 'word': 'a', 'start': 0, 'end': 1}]
[{'entity_group': 'LABEL_3', 'score': 0.056751117, 'word': 'single', 'start': 0, 'end': 6}]
[{'entity_group': 'LABEL_3', 'score': 0.10323574, 'word': 'tokenized', 'start': 0, 'end': 9}]
[{'entity_group': 'LABEL_3', 'score': 0.09412522, 'word': 'message', 'start': 0, 'end': 7}]

If I know try to pass a sequence os messages:

text = [
    ["Hello", "this", "is", "a", "single", "tokenized", "message"], 
    ["another", "tokenized", "message"], 
    ["short", "message"]
]

I was expecting I could do something like this:

for msg in nlp(text):
    for entity in nlp(msg):
	print(entity)

but I always end up with a ValueError:

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

even if I initialise the Pipeline like this:

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased", model_max_length=512, is_split_into_words=True, padding=True, truncation=True)

nlp = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="first"
    binary_output=True,
    ignore_labels=[],
)

I get the same ValueError.

So how can I use the TokenClassification pipeline to do batch processing over a sequence of already tokenised messages?

I’m starting to read the code of the pipeline to understand, but before I dig deeper can anyone tell me if this is even possible? Having the pipeline doing batch processing AND handling the padding/truncating for already tokenised input.

Thanks!
David

zbeloki · July 6, 2022, 3:34pm

I think the problem here is that the pipeline doesn’t expect the input split into words, but the full sentence in a string. Then you can pass an array of strings, where each string is a sentence itself. In other words, when you pass the sentence split into words, it assumes that you are passing N sentences of just one word each.

Now I’d like to pass the input sentence already split into words to the pipeline (TokenClassificationPipeline), like we do when using the raw tokenizer with the parameter is_split_into_words. Is that even possible? Can anybody help?

Topic		Replies	Views
How to use pipeline for 'token-classification' with already tokenized input? Beginners	0	705	February 3, 2022
Pipeline's Tokenizer vs training tokenizer Beginners	1	454	March 8, 2021
How do I setup a TextClassificationPipeline that truncates token sequences Beginners	0	337	September 29, 2021
Empty entity string when using TokenClassificationPipeline 🤗Transformers	1	596	February 15, 2022
Looking for tool class to do predictions 🤗Transformers	3	571	October 9, 2020

TokenClassification pipeline doing batch processing over a sequence of already tokenised messages

Related topics