I want to know how I/O happen when I tokenize the datasets with dataset.map().
To analyze I/O pattern of it, I tracked all systemcall that is called on dataset.map with strace.
I found the processed data was written to cache file, but I cannot see any open or read system call for the dataset file(in my case, it was Wikipedia).
In my sense, at least one or more open and read system call should be called.
I found that it is related with recordBatchReader in pyArrow or Arrow, but I could not check how actual I/O flow it is in this module.
Can you help me grasp the typical I/O flow during dataset.map ?