Batch_transform Pipeline?

Hey,

batch transform offers something called join_source, where you can join input and output files.

  • join_source ( str ) – The source of data to be joined to the transform output. It can be set to ‘Input’ meaning the entire input record will be joined to the inference result. You can use OutputFilter to select the useful portion before uploading to S3. (default: None). Valid values: Input, None.

But I am not sure if this works with the jsonl and json structure we need for HuggingFace. But you can find more about it here: Associate Prediction Results with Input Records - Amazon SageMaker

The easiest might be to write a custom python function, which post-process and merges your data files after the batch transform job is finished. If you use SageMaker Pipelines you could a lambda step for this.

1 Like