How should I handle pre/post-processing with slow tokenizers for tasks like NER and question answering?

wgpubs · January 3, 2022, 6:51pm

How should folks using slow tokenizers perform pre/post processing tasks for tasks like question answering and token classification … both of which, at least from the course, appear heavily dependent on the fast-tokenizer only methods word_ids() and sequence_ids().

Also, I’m curious to know why the slow tokenizers don’t have word_ids and sequence_ids methods … and if there is a way we can get at, or build, the equivalent of them for slow tokenizers?

Thanks much!

sgugger · January 10, 2022, 3:53pm

There are no easy model-agnostic way to tackle those tasks for slow tokenizers, so you should really use a fast one for those tasks.

Topic		Replies	Views
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6599	February 9, 2024
Tokenizer dataset is very slow 🤗Tokenizers	3	4263	March 2, 2024
Difference between tokenizer and tokenizerfast Beginners	4	4190	December 22, 2023
Issue with Flaubert Tokenizer as word_ids() method is not available for NER Task 🤗Tokenizers	1	1392	August 15, 2022
Text Classification tokenizer problems on inference Intermediate	4	2243	October 12, 2022

How should I handle pre/post-processing with slow tokenizers for tasks like NER and question answering?

Related topics