Is there a way to access the tokenizer used in the HuggingFacePredictor endpoint?
I’d like to create a list of nested sentences using the same tokenizer I will use in a downstream task. How would I go about replacing nli_tokenizer()
in the create_nested_sentences()
function below given that I ran the following in my Sagemaker Notebook instance?
Any advice would be greatly appreciated!
Sagemaker Endpoint:
from sagemaker.huggingface import HuggingFaceModel
hub = {
'HF_MODEL_ID':'facebook/bart-large-mnli',
'HF_TASK':'zero-shot-classification'
}
bart = HuggingFaceModel(
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
env=hub,
role=role,
)
predictor = bart.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")
Excerpt I’d like to replicate in Sagemaker
The code below currently works in Google Colab and I’d like to replicate it using my endpoint but I am not sure how to access the nli_tokenizer()
from the predictor()
above.
!pip install transformers
from transformers import AutoTokenizer
nli_tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
import spacy
nlp = spacy.load('en_core_web_sm')
def create_nested_sentences(document:str, token_max_length = 1024):
# Reference: https://discuss.huggingface.co/t/summarization-on-long-documents/920/7
nested = []
sent = []
length = 0
# Break up text into sentences
tokens = nlp(document)
for sentence in tokens.sents:
# Use the same tokenizer as a downstream inference
tokens_in_sentence = nli_tokenizer(str(sentence), truncation=True, padding=False)[0]
length += len(tokens_in_sentence)
if length < token_max_length:
sent.append(sentence)
else:
nested.append(sent)
sent = []
length = 0
if sent:
nested.append(sent)
return nested
document = '''
The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side.
During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930.
It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft).
Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.
'''
create_nested_sentences(document, token_max_length = 100)