Access tokenizer from within predict_fn

rogeriobromfman · January 14, 2022, 3:08pm

I’m trying to overwrite predict_fn for a named-entity recognition task. Mostly because I provide very long sequences. In this case, I need to use the tokenizer to break the long sequence down, and then merge the predictions of all the sub-sequences.

I call the tokenizer as:

sentences = tokenizer(sentence, max_length=max_length, stride=stride, truncation=True,
                      return_overflowing_tokens=True)

Since I have a stride, I need to properly care for overlapping tokens and their predictions.

I can access the model, as it is received as a parameter of the predict_fn, but how do I access the tokenizer?

Thanks!

marshmellow77 · January 14, 2022, 3:55pm

Hi Rogerio

model_fn actually takes the model directory as a parameter, not the model itself (see here: Deploy models to Amazon SageMaker). This means that you can load the tokenizer within model_fn like so:

tokenizer = AutoTokenizer.from_pretrained(model_dir)

Cheers
Heiko

rogeriobromfman · January 14, 2022, 3:59pm

Hi Heiko,

I apologise! I meant to say predict_fn and not model_fn. I’m correcting this in the original post.

Thanks!

marshmellow77 · January 14, 2022, 4:03pm

Ah, I see. No worries - you can just save the tokenizer in a variable and you’ll be able to access it throughout your inference.py code. Check this one as an example: https://github.com/marshmellow77/hf-summarization/blob/main/inference_code/inference.py

Hope that helps

rogeriobromfman · January 14, 2022, 4:10pm

This link is not working for me. Maybe it’s not a public repository?

marshmellow77 · January 14, 2022, 4:12pm

Apologies - this is the correct link: text-summarisation-project/inference.py at main · marshmellow77/text-summarisation-project · GitHub

rogeriobromfman · January 14, 2022, 4:14pm

I see what you did there! Great idea to pass the tokenizer and model in a dictionary! I’ll implement and update here once I get it to work. Thank you so much once again!!

rogeriobromfman · January 14, 2022, 7:53pm

Worked like a charm! I can now run very long sequences through the token classifier. Thank you Heiko!

Topic		Replies	Views
Access Tokenizer from Sagemaker BART Endpoint Amazon SageMaker	4	992	November 29, 2022
Token classification on long sentences 🤗Transformers	0	835	February 2, 2022
Truncation of input data for Summarization pipeline Amazon SageMaker	4	2636	November 16, 2021
Inference Toolkit - custom inference with multiple models Amazon SageMaker	1	633	April 4, 2024
How are the inputs tokenized when model deployment? Amazon SageMaker	13	4274	September 3, 2021

Access tokenizer from within predict_fn

Related topics