Xlm-Roberta Tokenizing

Seva · January 18, 2021, 5:43pm

Hello everyone. How would you solve the following problem:
I have a word and a sentence with this word, I also have the position of this word. How would you universally for all languages find the position of a tokenized word in a tokenized sentence?
I really have a problem with this one, it’s mainly with a Chinese language

valhalla · January 19, 2021, 5:59am

cc @Narsil, @anthony

Narsil · January 19, 2021, 8:52am

Hi @Seva,

If you use tokenizers (which are FastTokenizers within transformers.) then you have offsets whenever you call encode (if you use transformers you need to actually call tokenizer._tokenizer.encode to call the underlying tokenizers object).

tokenizer = AutoTokenizer.from_pretrained('roberta-large')

encoded = tokenizer._tokenizer.encode("This is a test")
encoded.offsets
# [(0, 0), (0, 4), (5, 7), (8, 9), (10, 14), (0, 0)]
# <s>,  "This", "is", "a", "test", </s>

So from that you should be able to recover your information.
This information is available through Fast Tokenizer method calls too.

anthony · January 19, 2021, 5:24pm

You can have a look at the doc for BatchEncoding in transformers which is the type returned from the tokenizers methods (encode_plus, batch_encode_plus, __call__).

There are a bunch of helpers available to easily map between characters, words, and tokens.

Topic		Replies	Views
Tokenized sequence lengths 🤗Tokenizers	6	2015	March 10, 2022
Issues with offset_mapping values 🤗Tokenizers	4	4454	February 15, 2022
Tokenizers: How to get representation for a single word form subwords Beginners	0	279	July 20, 2021
Build a RoBERTa tokenizer from scratch 🤗Tokenizers	5	3347	December 12, 2020
Issue with XLM-RoBERTa tokenizer 🤗Tokenizers	1	301	August 15, 2024

Xlm-Roberta Tokenizing

Related topics