Is_split_into_words and using a pipeline

I trained a Distilbert using text already split into words and in sentences, so the tokenizer was executed using is_split_into_words parameter as True.

When using a pipeline, do I have to pass the input already split into words as I did during training? Is this required or can I send the text just sentencized?

1 Like

A bit late, but I recently had this question and couldn’t find a ready answer online. So making this reply for future folks.

The pipeline contains a “sanitise_parameters” function which creates its own tokenizer parameters from scratch. Therefore it does not seem possible at this point to pass the “is_split_into_words” parameter to the pipeline, as even if you could it would be overwritten in the sanitise_parameters function.

Yes this is kind of an edge case, technically the is_split_into_words keyword argument could be added similar to other postprocessing parameters. Feel free to open an issue on the Transformers repository so that the team can discuss this

1 Like