How to use pipeline for 'token-classification' with already tokenized input?

I have fine-tuned a bert based model for ‘token-classification’ and would like to perform inference on it. From what i gathered the pipeline() would be a good fit here.

However it requires not only the model but a tokenizer as well. My data is already tokenized into a list of words and i would like to feed it to the model as is. Providing a tokenizer based on the model however splits the words in unexpected ways. Something like a “don’t do anything” tokenizer would be required.

Here is the code i am using where i concat the entire list of words and let the tokenizer do the splitting again:

model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
pipeline = pipeline('token-classification', model=model, tokenizer=tokenizer)
string_of_words = ' '.join(list_of_words)
pipeline(string_of_words)

Any help would be appreciated.

3 Likes