Workers crashing in HF Inferentia inference

Hey everyone,

I’m currently experimenting with Inferentia chips on AWS with Sagemaker realtime inference endpoints and I’m running into issues related to different workers on the instance.

I have a model based on XLM-Roberta-base and I’m using Inf1.xlarge instances with the following code to compile the model :

import os
import tensorflow
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification

NUM_CORES = 4

# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('./model-folder/')
model = AutoModelForSequenceClassification.from_pretrained('./model-folder/', torchscript=True)

# create dummy input for max length 512
dummy_input = "dummy input which will be padded later"
max_length = 512
embeddings = tokenizer(dummy_input, max_length=max_length, padding="max_length", return_tensors="pt")
neuron_inputs = tuple(embeddings.values())

# compile model with torch.neuron.trace and update config
model_neuron = torch.neuron.trace(model, neuron_inputs,
                                  compiler_args=['--neuroncore-pipeline-cores', str(NUM_CORES)],
                                  verbose=1)
model.config.update({"traced_sequence_length": max_length})

# save tokenizer, neuron model and config for later use
save_dir = "tmp"
os.makedirs("tmp", exist_ok=True)
model_neuron.save(os.path.join(save_dir, "neuron_model.pt"))
tokenizer.save_pretrained(save_dir)
model.config.save_pretrained(save_dir)

and the following inference code

import os
from transformers import AutoConfig, AutoTokenizer
import torch
import torch.neuron

# To use one neuron core per worker
os.environ["NEURON_RT_NUM_CORES"] = "4"

# saved weights name
AWS_NEURON_TRACED_WEIGHTS_NAME = "neuron_model.pt"

def model_fn(model_dir):
    # load tokenizer and neuron model from model_dir
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME))
    model_config = AutoConfig.from_pretrained(model_dir)

    return model, tokenizer, model_config


def predict_fn(data, model_tokenizer_model_config):
    model, tokenizer, model_config = model_tokenizer_model_config

    inputs = data.pop("inputs", data)
    embeddings = tokenizer(
        inputs,
        return_tensors="pt",
        max_length=model_config.traced_sequence_length,
        padding="max_length",
        truncation=True,
    )
    # convert to tuple for neuron model
    neuron_inputs = tuple(embeddings.values())

    with torch.no_grad():
        predictions = model(*neuron_inputs)[0]
        scores = torch.nn.Softmax(dim=1)(predictions)

    # return dictonary, which will be json serializable
    return [{"label": model_config.id2label[item.argmax().item()],
             "score": item.max().item()} for item in scores]

As soon as the instance goes up the Batch Aggregator seems to die :

com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.

I believe it tries to spin up 4 workers as I see 4 different log lines with the following (different IDs)

W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model model loaded io_fd=b25073fffe1fd064-00000014-00000003-e6d88c0435246fc4-b7d161f3

Calling the endpoint sometimes works but most of the time results in the following error :

ERROR NRT:nrt_allocate_neuron_cores NeuronCore(s) not available - Requested:4 Available:0

I believe that the number of workers would be automatically set such that all Neuron Cores are used (4 in this case), however it seems to want to spawn 4 workers despite the fact that one worker already uses all 4 available cores. Am I misunderstanding something?

When I try to switch to using a single Neuron core, I get memory allocation errors, likely as expected because 8GB is not enough to fit 4 copies of XLM Roberta base.

Any pointers to resolve these issues would be helpful. Thank you!

Could you share how you deployed the model? Aslo could you share the whole stacktrace?

Hi @philschmid,

I had deployed the model using the Sagemaker UI (more on that later) and the stack trace was

2022-09-07T12:05:59,337 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error
2022-09-07T12:05:59,337 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2022-09-07T12:05:59,337 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.7/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 219, in handle
2022-09-07T12:05:59,338 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.initialize(context)
2022-09-07T12:05:59,338 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.7/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 77, in initialize
2022-09-07T12:05:59,338 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.model = self.load(self.model_dir)
2022-09-07T12:05:59,338 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/.sagemaker/mms/models/model/code/inference.py", line 15, in model_fn
2022-09-07T12:05:59,338 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME))
2022-09-07T12:05:59,338 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.7/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
2022-09-07T12:05:59,338 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     script_module = jit_load(*args, **kwargs)
2022-09-07T12:05:59,338 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.7/site-packages/torch/jit/_serialization.py", line 161, in load
2022-09-07T12:05:59,339 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
2022-09-07T12:05:59,343 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - RuntimeError: [enforce fail at CPUAllocator.cpp:68] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 142607314 bytes. Error code 12 (Cannot allocate memory)
2022-09-07T12:05:59,344 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 
2022-09-07T12:05:59,344 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:
2022-09-07T12:05:59,344 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 
2022-09-07T12:05:59,344 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2022-09-07T12:05:59,344 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.7/site-packages/mms/service.py", line 108, in predict
2022-09-07T12:05:59,344 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
2022-09-07T12:05:59,344 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.7/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 243, in handle
2022-09-07T12:05:59,345 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise PredictionException(str(e), 400)
2022-09-07T12:05:59,345 [INFO ] W-model-4-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: [enforce fail at CPUAllocator.cpp:68] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 142607314 bytes. Error code 12 (Cannot allocate memory) : 400

However, I then realized by that deploying through the Sagemaker SDK, one could reduce the number of workers from the default 4.

The default number of workers is 4 and I guess each worker loads a model in memory which makes the memory usage go over the 8GB provided on an Inf1.xlarge instance. The neuron traced model is 874MB so I guess that with the extra overhead of the libraries for each worker, this goes over the limit.

I subsequently compiled the model to make use of 2 neuron cores and deployed using only 2 workers. This seems to work!

I have two questions concerning this :

  • Is there any way fairly straightforward way to shrink the model further to be able to load 4 models on an Inf1.xlarge without going into distillation or pruning?
  • Is there a way to pack multiple different models into a single Inferentia endpoint to be able to get simultaneously multiple predictions from one single input?

Thanks a lot!

Best,
Vil

1 Like

Great to hear that you managed to get it to work. Regarding your other questions:

Is there any way fairly straightforward way to shrink the model further to be able to load 4 models on an Inf1.xlarge without going into distillation or pruning?

Inferentia is currently not supporting int8 so the only way to shrink the “model” is go with another model, e.g. distilroberta…

Is there a way to pack multiple different models into a single Inferentia endpoint to be able to get simultaneously multiple predictions from one single input?

Yes that should be doable! Therefore you just need to add your models to the model.tar.gz and adjust the inference.py to make sure you are loading them correctly and then run the inference in the way you want.

2 Likes