Finding the language specific tokens from XLMR

FaisalHejary · November 7, 2022, 8:55am

Hello everyone,

I’m trying to build a way to improve zero-shot-classification. I need to generate words that can be similar to the label word, using the embedding. However, running what I have in mind on the entirety of xlmr- tokens seems expensive while I’m only using 2 languages (English and Arabic). so I was wondering is there a way to get the indexes of the English and Arabica tokenize only.

for example that, all Arabica words and word-parts are from id:100 to id:1000.

Topic		Replies	Views
XLM classification non pre trained language Beginners	0	174	April 24, 2023
Customized tokenizers Beginners	0	250	August 18, 2022
Xlm-Roberta Tokenizing 🤗Transformers	3	470	January 19, 2021
Multilingual Finetuning XLS-R 🤗Transformers	1	388	January 11, 2022
Tokenizer effect on the fine-tuning Research	0	364	October 6, 2023

Finding the language specific tokens from XLMR

Related topics