Enhaced word_ids() API for Chinese or CJK languages?

cramraj8 · June 2, 2022, 5:52pm

Is there an API like tokenizer.word_ids() to map/align sub-word to whole-word in CJK languages ? The word_ids() is useful for white-space tokenizable languages like Farsi and Russian. But I have difficulty in mapping Chinese to get the whole-word vocabulary embeddings.

Topic		Replies	Views
Getting Wholeword corresponding to a subword in a text? 🤗Tokenizers	0	282	October 8, 2022
Word_to_tokens() and word_ids() ---- microsoft/deberta-v2/v3 🤗Tokenizers	2	488	July 14, 2022
Issue with Extracting Word Ids from Batch Encoding Object Beginners	2	1012	November 1, 2022
Word_ids not working with deberta_v2 🤗Tokenizers	1	1306	August 12, 2022
Question About XLNetTokenizer Beginners	1	318	October 21, 2022

Enhaced word_ids() API for Chinese or CJK languages?

Related topics