Access tokenizer from within predict_fn

I’m trying to overwrite predict_fn for a named-entity recognition task. Mostly because I provide very long sequences. In this case, I need to use the tokenizer to break the long sequence down, and then merge the predictions of all the sub-sequences.

I call the tokenizer as:

sentences = tokenizer(sentence, max_length=max_length, stride=stride, truncation=True,

Since I have a stride, I need to properly care for overlapping tokens and their predictions.

I can access the model, as it is received as a parameter of the predict_fn, but how do I access the tokenizer?


Hi Rogerio

model_fn actually takes the model directory as a parameter, not the model itself (see here: Deploy models to Amazon SageMaker). This means that you can load the tokenizer within model_fn like so:

tokenizer = AutoTokenizer.from_pretrained(model_dir)


Hi Heiko,

I apologise! I meant to say predict_fn and not model_fn. I’m correcting this in the original post.


Ah, I see. No worries - you can just save the tokenizer in a variable and you’ll be able to access it throughout your code. Check this one as an example:

Hope that helps :slight_smile:

This link is not working for me. Maybe it’s not a public repository?

Apologies - this is the correct link: text-summarisation-project/ at main · marshmellow77/text-summarisation-project · GitHub

I see what you did there! Great idea to pass the tokenizer and model in a dictionary! I’ll implement and update here once I get it to work. Thank you so much once again!!

Worked like a charm! I can now run very long sequences through the token classifier. Thank you Heiko!