Hi,
I have a very large dataset that fails in the preprocess stage due to its size.
I’ve looked into: sharding the dataset vs. using batched=True on the map(…) function.
What are the advantages/use cases of either cases?
Also can anyone point me to an example that uses each methodology?
Shards - The official docs lacks the use of shards (eg. how to do a full train using shards – do you run the training scheme such that you do 1 epoch for the shard and then switch to the next shard?)
Batched - Not sure what it means by:
Provided function which is applied to all elements of table returns a dict of types [<class 'list'>, <class 'list'>, <class 'str'>]. When using batched=True, make sure provided function returns a dict of types like (<class 'list'>, <class 'numpy.ndarray'>)
Any clear example showing batched=True
will help!