Batch Transform Passing Entire Batch at Once

Hi — as title states, I’m using batch transform and hadn’t hit any errors with smaller batch sizes but am getting DefaultCPUAllocator: can't allocate memory: you tried to allocate 12582912000 bytes. Error code 12 (Cannot allocate memory) : 400 when I attempt to pass a larger set of 1000 texts.

I’m using a custom BERT-base model, here’s my

def predict_fn(data, model_tokenizer):
    model, tokenizer = model_tokenizer

    print("LENGTH OF DATA: ", len(data)) # This is outputting 1000, indicating the whole batch was passed at once
    encoded_inputs = tokenizer(data, padding='max_length', max_length=512, truncation=True, return_tensors="pt")
    with torch.no_grad():
        output = model(encoded_inputs['input_ids'], encoded_inputs['attention_mask'])
    return {"output": output.tolist()}

def input_fn(input_data: str, content_type):
    if content_type == 'text/csv':
        stream = StringIO(input_data)
        request_list = list(csv.DictReader(stream))
        return {"inputs": [entry['inputs'] for entry in request_list]}
    elif content_type == 'application/json':
        return [json.loads(line)['inputs'] for line in input_data.split(sep='\n')[:-1]]

And here’s the code I’m using to set up the transform job — I’ve tried manually setting max_concurrent_transforms to a lot of values, but even the default gets this problem.

batch_job = self.model.transformer(
    instance_count=instance_count, # 1
    instance_type=instance_type, # ml.g4dn.xlarge
    output_path=s3_path_join('s3://', self.bucket, 'output'),
    data=s3_path_join('s3://', self.bucket, 'input', f'uncat_{self.model_type}_ads.jsonl'),
    content_type = 'application/json',

Any thoughts? I’m thinking I could manually set up a batch-style processing in predict_fn but that seems like a poor workaround.

For batch transform, the maximum size of the input data per invocation is 100 MB. This value can’t be adjusted.

See: Service restrictions and quotas - AWS Marketplace or Use Batch Transform - Amazon SageMaker

Hi, thanks for the response. I think I might be misunderstanding — my input of 1000 texts is less than a MB, it’s only when it feeds into the predict_fn that it’s embedded and balloons to ~12GB according to the error message.

I thought that the MultiRecord/max_concurrent_transforms batched those 1000 into max_concurrent_transforms-size batches and then fed those into predict_fn, is that not correct? My thought was that setting max_c_t to 32 would only feed in 32 texts at once, for instance and keep the size down.

EDIT: I think I’m understanding now — in order to use MultiRecord successfully I’ll need to either pre-embed so MaxPayload can correctly assess how much memory is in the mini-batch or find out the limit manually (by embedding myself and getting the size → seeing how much I can fit in 100 MB).