Is batch processing with the TokenClassification pipeline supported?
I have a fine-tuned model
which performs token classification, and a tokenizer
which was built as:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
and this works fine in a pipeline when processing a single document/message:
nlp = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="first", # max, none, simple, average
binary_output=True,
ignore_labels=[],
)
text = ["Hello", "this", "is", "a", "single", "tokenized", "message"]
for token in nlp(text):
print(token)
[{'entity_group': 'LABEL_2', 'score': 0.07955505, 'word': 'Hello', 'start': 0, 'end': 5}]
[{'entity_group': 'LABEL_2', 'score': 0.06315145, 'word': 'this', 'start': 0, 'end': 4}]
[{'entity_group': 'LABEL_2', 'score': 0.08200004, 'word': 'is', 'start': 0, 'end': 2}]
[{'entity_group': 'LABEL_2', 'score': 0.07786057, 'word': 'a', 'start': 0, 'end': 1}]
[{'entity_group': 'LABEL_3', 'score': 0.056751117, 'word': 'single', 'start': 0, 'end': 6}]
[{'entity_group': 'LABEL_3', 'score': 0.10323574, 'word': 'tokenized', 'start': 0, 'end': 9}]
[{'entity_group': 'LABEL_3', 'score': 0.09412522, 'word': 'message', 'start': 0, 'end': 7}]
If I know try to pass a sequence os messages:
text = [
["Hello", "this", "is", "a", "single", "tokenized", "message"],
["another", "tokenized", "message"],
["short", "message"]
]
I was expecting I could do something like this:
for msg in nlp(text):
for entity in nlp(msg):
print(entity)
but I always end up with a ValueError
:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
even if I initialise the Pipeline like this:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased", model_max_length=512, is_split_into_words=True, padding=True, truncation=True)
nlp = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="first"
binary_output=True,
ignore_labels=[],
)
I get the same ValueError
.
So how can I use the TokenClassification pipeline to do batch processing over a sequence of already tokenised messages?
I’m starting to read the code of the pipeline to understand, but before I dig deeper can anyone tell me if this is even possible? Having the pipeline doing batch processing AND handling the padding/truncating for already tokenised input.
Thanks!
David