Access Tokenizer from Sagemaker BART Endpoint

pleonova · November 28, 2022, 3:40am

Is there a way to access the tokenizer used in the HuggingFacePredictor endpoint?

I’d like to create a list of nested sentences using the same tokenizer I will use in a downstream task. How would I go about replacing nli_tokenizer() in the create_nested_sentences() function below given that I ran the following in my Sagemaker Notebook instance?

Any advice would be greatly appreciated!

Sagemaker Endpoint:

from sagemaker.huggingface import HuggingFaceModel

hub = {
  'HF_MODEL_ID':'facebook/bart-large-mnli',
  'HF_TASK':'zero-shot-classification'
}

bart = HuggingFaceModel(
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    env=hub,
    role=role,
)

predictor = bart.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")

Excerpt I’d like to replicate in Sagemaker
The code below currently works in Google Colab and I’d like to replicate it using my endpoint but I am not sure how to access the nli_tokenizer() from the predictor() above.

!pip install transformers

from transformers import  AutoTokenizer
nli_tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')

import spacy
nlp = spacy.load('en_core_web_sm')

def create_nested_sentences(document:str, token_max_length = 1024):
  # Reference: https://discuss.huggingface.co/t/summarization-on-long-documents/920/7
    nested = []
    sent = []
    length = 0
    # Break up text into sentences
    tokens = nlp(document)

    for sentence in tokens.sents:
        # Use the same tokenizer as a downstream inference
        tokens_in_sentence = nli_tokenizer(str(sentence), truncation=True, padding=False)[0] 
        length += len(tokens_in_sentence)
        if length < token_max_length:
          sent.append(sentence)
        else:
          nested.append(sent)
          sent = []
          length = 0

    if sent:
        nested.append(sent)
    return nested

document = '''
The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. 
During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930.
It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). 
Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.
'''
create_nested_sentences(document, token_max_length = 100)

marshmellow77 · November 29, 2022, 5:16pm

Hi @pleonova , do I understand correctly that you want to use your logic for nested sentences inside the Sagemaker endpoint?

pleonova · November 29, 2022, 6:01pm

Hi @marshmellow77, I originally had intentions of using the nested sentences outside of the endpoint as a prep step for my long text. However, I guess I could also use it inside a custom function inside the predict_fn(). I was hoping to use the HuggingFace Sagemaker toolkit implementation off the shelf.

My ideal order of operation:

Convert long text into nested sentences based on the token length (create tokens using AutoTokenizer.from_pretrained('facebook/bart-large-mnli'))
Feed those nested sentences into a conditional generation model (BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn"))
Feed the summary results from into a sequence classification model (AutoModelForSequenceClassification.from_pretrained(nli_model_name))

^ I’d like to do the above using Sagemaker Endpoints. I currently have this working on my hugging face app here.

marshmellow77 · November 29, 2022, 8:51pm

I see. In that case t seems to me you can override the relevant methods in the SageMaker Hugging Face Inference Toolkit:

You can load both models in the model_fn() method
You can override input_fn() to pre-process the data
You can chain the conditional generation model and the sequence classification model in the predict_fn() method

See this documentation and this example. Hope this helps.

Cheers
Heiko

pleonova · November 29, 2022, 9:09pm

Thank you Heiko! I wasn’t sure if I was missing anything in terms of being able to access the tokenizer from the endpoint without chaining the models in the custom predict_fn().

Topic		Replies	Views
Deploying Sentence Transformer as sagemaker endpoint Amazon SageMaker	18	8162	March 26, 2024
Access tokenizer from within predict_fn Amazon SageMaker	7	1028	January 14, 2022
Sentence similarity models on Sagemaker Amazon SageMaker	6	2656	January 12, 2024
Facebook/bart-large-mnli inference when deployed on SageMaker Amazon SageMaker	1	1082	April 29, 2022
Truncation of input data for Summarization pipeline Amazon SageMaker	4	2634	November 16, 2021

Access Tokenizer from Sagemaker BART Endpoint

Related topics