I have a dataset in JSON format with ‘id’ and ‘text’ columns. Currently, I’m using the following pipeline configuration in AWS:
hub = {
'HF_MODEL_ID':'distilbert-base-uncased',
'HF_TASK':'feature-extraction'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
env=hub, # configuration for loading model from Hub
role=role, # IAM role with permissions to create an endpoint
transformers_version="4.26", # Transformers version used
pytorch_version="1.13", # PyTorch version used
py_version='py39', # Python version used
)
# create Transformer to run our batch job
batch_job = huggingface_model.transformer(
instance_count=1,
instance_type='ml.m5.xlarge',
output_path=output_s3_path, # we are using the same s3 path to save the output with the input
strategy='SingleRecord')
I’m using a batch transform job to generate the output, which currently contains only the extracted text. However, I also want to include the ‘id’ associated with each text in the output file. Is there a way to achieve this, and if so, how can I modify my configuration to include the ‘id’ in the output file? Any guidance or examples would be greatly appreciated!