Sorry to repeat this in forum, as I have already asked this in stack overflow. Intent is not to spam but to get the response as fast as possible since this is very critical for my project.
I am trying to add new tokens to Layoutxlm tokenizer ("microsoft/layoutxlm-base) and following is the code for the same.
model = LayoutLMv2ForRelationExtraction.from_pretrained("microsoft/layoutxlm-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutxlm-base")
new_tokens = tokens
# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())
# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))
# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))
## Save the modified tokenizer
model.save_pretrained('layout_xlm_base_model')
tokenizer.save_pretrained('layout_xlm_base_tokenizer')
tokenizer = AutoTokenizer.from_pretrained('layout_xlm_base_tokenizer')
input_ids = tokenizer.encode(text = tokens, boxes = bboxes, is_pretokenized=False)
print(len(input_ids))
Prints 126 for given tokens 124 which is correct
But when I define this same thing in a function or in a script it doesn’t work
from transformers import AutoTokenizer, LayoutLMv2ForRelationExtraction
def add_tokens(tokens:list=None, tokenizer:AutoTokenizer=None, model:LayoutLMv2ForRelationExtraction=None):
new_tokens = tokens
# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())
# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))
# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))
tokenizer.save_pretrained('layout_xlm_base_tokenizer')
model.save_pretrained('layout_xlm_base_model')
add_tokens(tokens = tokens, tokenizer = tokenizer, model = model)
tokenizer = AutoTokenizer.from_pretrained('layout_xlm_base_tokenizer')
model = LayoutLMv2ForRelationExtraction.from_pretrained('layout_xlm_base_model')
bboxes=bboxes
input_ids = tokenizer.encode(text = tokens, boxes = bboxes)
print(len(input_ids))
prints 248 which was before modification which is not intended
what am I missing?