Hello everyone. How would you solve the following problem:
I have a word and a sentence with this word, I also have the position of this word. How would you universally for all languages find the position of a tokenized word in a tokenized sentence?
I really have a problem with this one, it’s mainly with a Chinese language
Hi @Seva,
If you use tokenizers
(which are FastTokenizers within transformers
.) then you have offsets whenever you call encode
(if you use transformers
you need to actually call tokenizer._tokenizer.encode
to call the underlying tokenizers
object).
tokenizer = AutoTokenizer.from_pretrained('roberta-large')
encoded = tokenizer._tokenizer.encode("This is a test")
encoded.offsets
# [(0, 0), (0, 4), (5, 7), (8, 9), (10, 14), (0, 0)]
# <s>, "This", "is", "a", "test", </s>
So from that you should be able to recover your information.
This information is available through Fast Tokenizer
method calls too.
You can have a look at the doc for BatchEncoding
in transformers
which is the type returned from the tokenizers methods (encode_plus
, batch_encode_plus
, __call__
).
There are a bunch of helpers available to easily map between characters, words, and tokens.