I’m trying to build a way to improve zero-shot-classification. I need to generate words that can be similar to the label word, using the embedding. However, running what I have in mind on the entirety of xlmr- tokens seems expensive while I’m only using 2 languages (English and Arabic). so I was wondering is there a way to get the indexes of the English and Arabica tokenize only.
for example that, all Arabica words and word-parts are from id:100 to id:1000.