Cache is too large 300gb data -> 3TB cache

Dear Team,

I use the map to process data, then 300GB dataset becomes 3TB cache, and run out of my device storage.

possible solutions:

  1. I understand we can use set_transform to process on the fly. But how could I do remove_columns after using set_transfom? as I have a remove_columns after map.

May I know if you have a good solution for this? thank you!

solved with set_transform, and remove columns within the preprocess_function

1 Like

@yiwc - It would be of great help if you can add sample code on how you solved this.