Hey @philschmid - yes that’s right, we’ve been using a custom inference.py script. The only thing we have in there is a custom model_fn to return all scores, instead of just the top category for our categorisation task:
import os
from sagemaker_huggingface_inference_toolkit import transformers_utils
GPU_ID = 0
GPU_NOT_AVAILABLE_ID = -1
def model_fn(model_dir):
"""
The Load handler is responsible for loading the Hugging Face transformer model.
It can be overridden to load the model from storage
Returns:
hf_pipeline (Pipeline): A Hugging Face Transformer pipeline.
"""
config_file = "config.json"
if not config_file in os.listdir(model_dir):
raise ValueError(f"{config_file} not found", 403)
task = transformers_utils.infer_task_from_model_architecture(f"{model_dir}/{config_file}")
device_id = GPU_ID if transformers_utils._is_gpu_available() else GPU_NOT_AVAILABLE_ID
hf_pipeline = transformers_utils.get_pipeline(
task=task,
model_dir=model_dir,
device=device_id,
return_all_scores=True,
)
return hf_pipeline
But we’ve also tried the “zero-code” deployment and inference is just as slow using transformers 4.12.3.
The task is sequence classification, in this case there are 60 different possible targets. The model is loaded as:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=60)
and then we fine tune on our labelled dataset. Let me know if you require any more information, thank you. Owen