Huggingface tokenizer not working properly when defined in a function / different program

a8hik · May 29, 2023, 9:11am

Sorry to repeat this in forum, as I have already asked this in stack overflow. Intent is not to spam but to get the response as fast as possible since this is very critical for my project.

I am trying to add new tokens to Layoutxlm tokenizer ("microsoft/layoutxlm-base) and following is the code for the same.

model = LayoutLMv2ForRelationExtraction.from_pretrained("microsoft/layoutxlm-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutxlm-base")

new_tokens = tokens

# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())

# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))
# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))
## Save the modified tokenizer

model.save_pretrained('layout_xlm_base_model')
tokenizer.save_pretrained('layout_xlm_base_tokenizer')

tokenizer = AutoTokenizer.from_pretrained('layout_xlm_base_tokenizer')
input_ids = tokenizer.encode(text = tokens, boxes = bboxes, is_pretokenized=False)
print(len(input_ids))

Prints 126 for given tokens 124 which is correct
But when I define this same thing in a function or in a script it doesn’t work

from transformers import AutoTokenizer, LayoutLMv2ForRelationExtraction

def add_tokens(tokens:list=None, tokenizer:AutoTokenizer=None, model:LayoutLMv2ForRelationExtraction=None):
    new_tokens = tokens

        # check if the tokens are already in the vocabulary
    new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())

        # add the tokens to the tokenizer vocabulary
    tokenizer.add_tokens(list(new_tokens))
    
        # add new, random embeddings for the new tokens
    model.resize_token_embeddings(len(tokenizer))
    tokenizer.save_pretrained('layout_xlm_base_tokenizer')
    model.save_pretrained('layout_xlm_base_model')



add_tokens(tokens = tokens, tokenizer = tokenizer, model = model)
tokenizer = AutoTokenizer.from_pretrained('layout_xlm_base_tokenizer')
model = LayoutLMv2ForRelationExtraction.from_pretrained('layout_xlm_base_model')
bboxes=bboxes
input_ids = tokenizer.encode(text = tokens, boxes = bboxes)
print(len(input_ids))

prints 248 which was before modification which is not intended
what am I missing?

Topic		Replies	Views
Adding tokens, but tokenizer doesn't use them 🤗Tokenizers	1	410	August 14, 2024
Extending the tokenizer affects model generation Intermediate	3	187	December 19, 2024
Vocabulary count mismatch when loading the previously created tokenizer 🤗Transformers	0	168	January 8, 2024
Can't load tokenizer with added special tokens 🤗Transformers	0	827	March 29, 2022
Can't load pre-trained tokenizer with additional new tokens 🤗Transformers	3	4435	August 10, 2021

Huggingface tokenizer not working properly when defined in a function / different program

Related topics