Access tokenizer from within predict_fn

I’m trying to overwrite predict_fn for a named-entity recognition task. Mostly because I provide very long sequences. In this case, I need to use the tokenizer to break the long sequence down, and then merge the predictions of all the sub-sequences.

I call the tokenizer as:

sentences = tokenizer(sentence, max_length=max_length, stride=stride, truncation=True,
                      return_overflowing_tokens=True)

Since I have a stride, I need to properly care for overlapping tokens and their predictions.

I can access the model, as it is received as a parameter of the predict_fn, but how do I access the tokenizer?

Thanks!

Hi Rogerio

model_fn actually takes the model directory as a parameter, not the model itself (see here: Deploy models to Amazon SageMaker). This means that you can load the tokenizer within model_fn like so:

tokenizer = AutoTokenizer.from_pretrained(model_dir)

Cheers
Heiko

Hi Heiko,

I apologise! I meant to say predict_fn and not model_fn. I’m correcting this in the original post.

Thanks!

Ah, I see. No worries - you can just save the tokenizer in a variable and you’ll be able to access it throughout your inference.py code. Check this one as an example: https://github.com/marshmellow77/hf-summarization/blob/main/inference_code/inference.py

Hope that helps :slight_smile:

This link is not working for me. Maybe it’s not a public repository?

Apologies - this is the correct link: text-summarisation-project/inference.py at main · marshmellow77/text-summarisation-project · GitHub

I see what you did there! Great idea to pass the tokenizer and model in a dictionary! I’ll implement and update here once I get it to work. Thank you so much once again!!

Worked like a charm! I can now run very long sequences through the token classifier. Thank you Heiko!

2 Likes