Hey everyone,
I’m currently experimenting with Inferentia chips on AWS with Sagemaker realtime inference endpoints and I’m running into issues related to different workers on the instance.
I have a model based on XLM-Roberta-base and I’m using Inf1.xlarge instances with the following code to compile the model :
import os
import tensorflow
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
NUM_CORES = 4
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('./model-folder/')
model = AutoModelForSequenceClassification.from_pretrained('./model-folder/', torchscript=True)
# create dummy input for max length 512
dummy_input = "dummy input which will be padded later"
max_length = 512
embeddings = tokenizer(dummy_input, max_length=max_length, padding="max_length", return_tensors="pt")
neuron_inputs = tuple(embeddings.values())
# compile model with torch.neuron.trace and update config
model_neuron = torch.neuron.trace(model, neuron_inputs,
compiler_args=['--neuroncore-pipeline-cores', str(NUM_CORES)],
verbose=1)
model.config.update({"traced_sequence_length": max_length})
# save tokenizer, neuron model and config for later use
save_dir = "tmp"
os.makedirs("tmp", exist_ok=True)
model_neuron.save(os.path.join(save_dir, "neuron_model.pt"))
tokenizer.save_pretrained(save_dir)
model.config.save_pretrained(save_dir)
and the following inference code
import os
from transformers import AutoConfig, AutoTokenizer
import torch
import torch.neuron
# To use one neuron core per worker
os.environ["NEURON_RT_NUM_CORES"] = "4"
# saved weights name
AWS_NEURON_TRACED_WEIGHTS_NAME = "neuron_model.pt"
def model_fn(model_dir):
# load tokenizer and neuron model from model_dir
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME))
model_config = AutoConfig.from_pretrained(model_dir)
return model, tokenizer, model_config
def predict_fn(data, model_tokenizer_model_config):
model, tokenizer, model_config = model_tokenizer_model_config
inputs = data.pop("inputs", data)
embeddings = tokenizer(
inputs,
return_tensors="pt",
max_length=model_config.traced_sequence_length,
padding="max_length",
truncation=True,
)
# convert to tuple for neuron model
neuron_inputs = tuple(embeddings.values())
with torch.no_grad():
predictions = model(*neuron_inputs)[0]
scores = torch.nn.Softmax(dim=1)(predictions)
# return dictonary, which will be json serializable
return [{"label": model_config.id2label[item.argmax().item()],
"score": item.max().item()} for item in scores]
As soon as the instance goes up the Batch Aggregator seems to die :
com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.
I believe it tries to spin up 4 workers as I see 4 different log lines with the following (different IDs)
W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model model loaded io_fd=b25073fffe1fd064-00000014-00000003-e6d88c0435246fc4-b7d161f3
Calling the endpoint sometimes works but most of the time results in the following error :
ERROR NRT:nrt_allocate_neuron_cores NeuronCore(s) not available - Requested:4 Available:0
I believe that the number of workers would be automatically set such that all Neuron Cores are used (4 in this case), however it seems to want to spawn 4 workers despite the fact that one worker already uses all 4 available cores. Am I misunderstanding something?
When I try to switch to using a single Neuron core, I get memory allocation errors, likely as expected because 8GB is not enough to fit 4 copies of XLM Roberta base.
Any pointers to resolve these issues would be helpful. Thank you!