I’m trying to overwrite predict_fn
for a named-entity recognition task. Mostly because I provide very long sequences. In this case, I need to use the tokenizer to break the long sequence down, and then merge the predictions of all the sub-sequences.
I call the tokenizer as:
sentences = tokenizer(sentence, max_length=max_length, stride=stride, truncation=True,
return_overflowing_tokens=True)
Since I have a stride, I need to properly care for overlapping tokens and their predictions.
I can access the model, as it is received as a parameter of the predict_fn
, but how do I access the tokenizer?
Thanks!
Hi Rogerio
model_fn
actually takes the model directory as a parameter, not the model itself (see here: Deploy models to Amazon SageMaker). This means that you can load the tokenizer within model_fn like so:
tokenizer = AutoTokenizer.from_pretrained(model_dir)
Cheers
Heiko
Hi Heiko,
I apologise! I meant to say predict_fn
and not model_fn
. I’m correcting this in the original post.
Thanks!
Ah, I see. No worries - you can just save the tokenizer in a variable and you’ll be able to access it throughout your inference.py code. Check this one as an example: https://github.com/marshmellow77/hf-summarization/blob/main/inference_code/inference.py
Hope that helps
This link is not working for me. Maybe it’s not a public repository?
I see what you did there! Great idea to pass the tokenizer and model in a dictionary! I’ll implement and update here once I get it to work. Thank you so much once again!!
Worked like a charm! I can now run very long sequences through the token classifier. Thank you Heiko!
2 Likes