Adding ID to Text Output in AWS Batch Transform Job with DistilBERT Model

nanaiv · September 29, 2023, 10:01am

I have a dataset in JSON format with ‘id’ and ‘text’ columns. Currently, I’m using the following pipeline configuration in AWS:

hub = {
    'HF_MODEL_ID':'distilbert-base-uncased',
    'HF_TASK':'feature-extraction'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.26",                             # Transformers version used
   pytorch_version="1.13",                                  # PyTorch version used
   py_version='py39',                                      # Python version used
)
# create Transformer to run our batch job
batch_job = huggingface_model.transformer(
    
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=output_s3_path, # we are using the same s3 path to save the output with the input
    strategy='SingleRecord')

I’m using a batch transform job to generate the output, which currently contains only the extracted text. However, I also want to include the ‘id’ associated with each text in the output file. Is there a way to achieve this, and if so, how can I modify my configuration to include the ‘id’ in the output file? Any guidance or examples would be greatly appreciated!

Topic		Replies	Views
Endpoint Deployment Amazon SageMaker	1	1108	September 20, 2021
ClientErro:400 when using batch transformer for inference Amazon SageMaker	11	2220	January 13, 2022
Batch_transform Pipeline? Amazon SageMaker	9	3421	September 28, 2021
Errors while running a sagemaker batch transform (inference) job Beginners	2	1074	May 15, 2023
Error in batch transform job with Huggingface model and SageMaker Beginners	6	1248	November 14, 2023

Adding ID to Text Output in AWS Batch Transform Job with DistilBERT Model

Related topics