Xlm-Roberta Tokenizing

Hello everyone. How would you solve the following problem:
I have a word and a sentence with this word, I also have the position of this word. How would you universally for all languages find the position of a tokenized word in a tokenized sentence?
I really have a problem with this one, it’s mainly with a Chinese language

cc @Narsil, @anthony

Hi @Seva,

If you use tokenizers (which are FastTokenizers within transformers.) then you have offsets whenever you call encode (if you use transformers you need to actually call tokenizer._tokenizer.encode to call the underlying tokenizers object).

tokenizer = AutoTokenizer.from_pretrained('roberta-large')

encoded = tokenizer._tokenizer.encode("This is a test")
# [(0, 0), (0, 4), (5, 7), (8, 9), (10, 14), (0, 0)]
# <s>,  "This", "is", "a", "test", </s>

So from that you should be able to recover your information.
This information is available through Fast Tokenizer method calls too.

You can have a look at the doc for BatchEncoding in transformers which is the type returned from the tokenizers methods (encode_plus, batch_encode_plus, __call__).

There are a bunch of helpers available to easily map between characters, words, and tokens.