How does a tokenzier (eg., AutoTokenizer) generate word_ids intergers?

mohit-sv · June 26, 2023, 6:49pm

Context of question: (scroll down to get to the Real question)

I need to find start_position and end_position of the sequence that I input to QA model (eg., RoBERTa, LLMv3) from the start-end character positions of the answer in a context. @NielsRogge in his notebook (Transformers-Tutorials/LayoutLMv2/DocVQA/Fine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub) does this by using word_ids information of the tokenized context and start-end token indices of the answer in the given context.
eg.,
context_token_indicies = [0, 1, 2, 3, 4, 5, 6, 7, || 8, 9, 10, 11, 12, 13 ||, 14, 15, 16, 17, 18] # pipes represent the scope of answer in tokenized context
ans_start_token_idx_in_context = 8
ans_start_token_idx_in_context = 13
context_word_ids = [0, 1, 2, 3, 3, 4, 5, 6, 7, || 8, 8, 9, 10, 10, 10, 11, 12, 13 ||, 14, 15, 16, 17, 18, 18] # pipes represent the scope of answer in tokenized context
(FYI: Repeated indices are created when a word like “bookworm” is broken down into “book” and “worm” by the tokenizer.)
we infer that…
start_position = 9
end_position = 17

Therefore, the success of the whole process is dependent on how we tokenize the context.

@NielsRogge used match, word_idx_start, word_idx_end = subfinder(words, answer.split()) (gets first match; doesn’t work for me) which is the same as doing context.split(), but this fails to split at “,”, “.” etc. as done by AutoTokenizer.
eg., “Great! Keep it up.” must be tokenized as [“Great”, “!”, “Keep”, “it”, “up”, “.”]
=> context_token_indicies = [0, 1, 2, 3, 4, 5] not [0, 1, 2, 3]
So I tried using nltk.word_tokenize. Though it does better, again this fails to split at “-”.
eg., “Great! Keep-it-up.” must be tokenized as [“Great”, “!”, “Keep”, “-”, “it”, “-”, “up”, “.”]
(Using offset_mapping instead of word_ids gives some other problems. RoBERTa, LLMv3 offsets look very different.)
I wonder how many other such edge cases exist. Thus the important point to be clarified is…

Real question:
“How do we need to tokenize context to find start_position and end_position for Question answering task?” This can be understood if we know “how huggingface tokenizers generate the integers in word_ids”.
Can anyone please answer one of the questions in the quotes?

Topic		Replies	Views
SQuAD with BERT tokenizer: Mismatch between span and token boundaries Models	0	505	November 12, 2021
Chapter.6 - Why are the tokens and word_ids for 2nd sentence are not returned? Course	0	445	January 3, 2023
I found a bug in your Course Course	3	1175	March 23, 2022
Question about BERT for qa Beginners	0	594	June 30, 2022
Issue with Extracting Word Ids from Batch Encoding Object Beginners	2	1012	November 1, 2022

How does a tokenzier (eg., AutoTokenizer) generate word_ids intergers?

Related topics