Slow inference using most recent docker image

Hey @philschmid - yes that’s right, we’ve been using a custom inference.py script. The only thing we have in there is a custom model_fn to return all scores, instead of just the top category for our categorisation task:

import os
from sagemaker_huggingface_inference_toolkit import transformers_utils

GPU_ID = 0
GPU_NOT_AVAILABLE_ID = -1

def model_fn(model_dir):
    """
    The Load handler is responsible for loading the Hugging Face transformer model.
    It can be overridden to load the model from storage
    Returns:
        hf_pipeline (Pipeline): A Hugging Face Transformer pipeline.
    """
    config_file = "config.json"
    if not config_file in os.listdir(model_dir):
        raise ValueError(f"{config_file} not found", 403)

    task = transformers_utils.infer_task_from_model_architecture(f"{model_dir}/{config_file}")
    device_id = GPU_ID if transformers_utils._is_gpu_available() else GPU_NOT_AVAILABLE_ID

    hf_pipeline = transformers_utils.get_pipeline(
        task=task,
        model_dir=model_dir,
        device=device_id,
        return_all_scores=True,
    )

    return hf_pipeline

But we’ve also tried the “zero-code” deployment and inference is just as slow using transformers 4.12.3.

The task is sequence classification, in this case there are 60 different possible targets. The model is loaded as:

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=60)

and then we fine tune on our labelled dataset. Let me know if you require any more information, thank you. Owen