Is_split_into_words and using a pipeline

zhcr · September 3, 2021, 10:54am

I trained a Distilbert using text already split into words and in sentences, so the tokenizer was executed using is_split_into_words parameter as True.

When using a pipeline, do I have to pass the input already split into words as I did during training? Is this required or can I send the text just sentencized?

swtb · May 9, 2024, 1:21pm

A bit late, but I recently had this question and couldn’t find a ready answer online. So making this reply for future folks.

The pipeline contains a “sanitise_parameters” function which creates its own tokenizer parameters from scratch. Therefore it does not seem possible at this point to pass the “is_split_into_words” parameter to the pipeline, as even if you could it would be overwritten in the sanitise_parameters function.

nielsr · May 9, 2024, 8:43pm

Yes this is kind of an edge case, technically the is_split_into_words keyword argument could be added similar to other postprocessing parameters. Feel free to open an issue on the Transformers repository so that the team can discuss this

Topic		Replies	Views
Pipeline's Tokenizer vs training tokenizer Beginners	1	452	March 8, 2021
How to use pipeline for 'token-classification' with already tokenized input? Beginners	0	704	February 3, 2022
TokenClassification pipeline doing batch processing over a sequence of already tokenised messages Intermediate	1	835	July 6, 2022
How can we customize pipeline? 🤗Transformers	5	743	January 19, 2021
Is_pretokenized argument for tokenizer doesn't work? 🤗Transformers	1	1798	September 18, 2020

Is_split_into_words and using a pipeline

Related topics