How to use pipeline for 'token-classification' with already tokenized input?

mpost · February 3, 2022, 4:13pm

I have fine-tuned a bert based model for ‘token-classification’ and would like to perform inference on it. From what i gathered the pipeline() would be a good fit here.

However it requires not only the model but a tokenizer as well. My data is already tokenized into a list of words and i would like to feed it to the model as is. Providing a tokenizer based on the model however splits the words in unexpected ways. Something like a “don’t do anything” tokenizer would be required.

Here is the code i am using where i concat the entire list of words and let the tokenizer do the splitting again:

model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
pipeline = pipeline('token-classification', model=model, tokenizer=tokenizer)
string_of_words = ' '.join(list_of_words)
pipeline(string_of_words)

Any help would be appreciated.

Topic		Replies	Views
TokenClassification pipeline doing batch processing over a sequence of already tokenised messages Intermediate	1	832	July 6, 2022
Pipeline's Tokenizer vs training tokenizer Beginners	1	445	March 8, 2021
Pipelines without a tokenizer 🤗Transformers	1	642	February 19, 2024
Option to load only tokenizer and model configuration into "token-classification" pipeline 🤗Tokenizers	0	783	November 25, 2022
How to use the model from the chapter "Fine-tuning a model with the Trainer API" Course	0	323	April 17, 2024

How to use pipeline for 'token-classification' with already tokenized input?

Related topics