How does a tokenzier (eg., AutoTokenizer) generate word_ids intergers?

Context of question: (scroll down to get to the Real question)

I need to find start_position and end_position of the sequence that I input to QA model (eg., RoBERTa, LLMv3) from the start-end character positions of the answer in a context. @NielsRogge in his notebook (Transformers-Tutorials/LayoutLMv2/DocVQA/Fine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb at master 路 NielsRogge/Transformers-Tutorials 路 GitHub) does this by using word_ids information of the tokenized context and start-end token indices of the answer in the given context.
eg.,
context_token_indicies = [0, 1, 2, 3, 4, 5, 6, 7, || 8, 9, 10, 11, 12, 13 ||, 14, 15, 16, 17, 18] # pipes represent the scope of answer in tokenized context
ans_start_token_idx_in_context = 8
ans_start_token_idx_in_context = 13
context_word_ids = [0, 1, 2, 3, 3, 4, 5, 6, 7, || 8, 8, 9, 10, 10, 10, 11, 12, 13 ||, 14, 15, 16, 17, 18, 18] # pipes represent the scope of answer in tokenized context
(FYI: Repeated indices are created when a word like 鈥渂ookworm鈥 is broken down into 鈥渂ook鈥 and 鈥渨orm鈥 by the tokenizer.)
we infer that鈥
start_position = 9
end_position = 17

Therefore, the success of the whole process is dependent on how we tokenize the context.

@NielsRogge used match, word_idx_start, word_idx_end = subfinder(words, answer.split()) (gets first match; doesn鈥檛 work for me) which is the same as doing context.split(), but this fails to split at 鈥,鈥, 鈥.鈥 etc. as done by AutoTokenizer.
eg., 鈥淕reat! Keep it up.鈥 must be tokenized as [鈥淕reat鈥, 鈥!鈥, 鈥淜eep鈥, 鈥渋t鈥, 鈥渦p鈥, 鈥.鈥漖
=> context_token_indicies = [0, 1, 2, 3, 4, 5] not [0, 1, 2, 3]
So I tried using nltk.word_tokenize. Though it does better, again this fails to split at 鈥-鈥.
eg., 鈥淕reat! Keep-it-up.鈥 must be tokenized as [鈥淕reat鈥, 鈥!鈥, 鈥淜eep鈥, 鈥-鈥, 鈥渋t鈥, 鈥-鈥, 鈥渦p鈥, 鈥.鈥漖
(Using offset_mapping instead of word_ids gives some other problems. RoBERTa, LLMv3 offsets look very different.)
I wonder how many other such edge cases exist. Thus the important point to be clarified is鈥

Real question:
鈥淗ow do we need to tokenize context to find start_position and end_position for Question answering task?鈥 This can be understood if we know 鈥渉ow huggingface tokenizers generate the integers in word_ids鈥.
Can anyone please answer one of the questions in the quotes?