Hey,
batch transform offers something called join_source
, where you can join input and output files.
- join_source ( str ) – The source of data to be joined to the transform output. It can be set to ‘Input’ meaning the entire input record will be joined to the inference result. You can use OutputFilter to select the useful portion before uploading to S3. (default: None). Valid values: Input, None.
But I am not sure if this works with the jsonl
and json
structure we need for HuggingFace
. But you can find more about it here: Associate Prediction Results with Input Records - Amazon SageMaker
The easiest might be to write a custom python function, which post-process and merges your data files after the batch transform job is finished. If you use SageMaker Pipelines
you could a lambda step for this.