How dataset.map() reads data

Jason-Hwang1 · August 6, 2024, 2:11pm

I want to know how I/O happen when I tokenize the datasets with dataset.map().

To analyze I/O pattern of it, I tracked all systemcall that is called on dataset.map with strace.

I found the processed data was written to cache file, but I cannot see any open or read system call for the dataset file(in my case, it was Wikipedia).

In my sense, at least one or more open and read system call should be called.

I found that it is related with recordBatchReader in pyArrow or Arrow, but I could not check how actual I/O flow it is in this module.

Can you help me grasp the typical I/O flow during dataset.map ?

Topic		Replies	Views
Fetching rows of a large Dataset by index 🤗Datasets	10	1660	March 15, 2021
Why is simply accessing dataset features so slow? 🤗Datasets	3	3836	November 22, 2021
Increase on disk space when using map() in Accelerate environment 🤗Datasets	2	1186	August 18, 2022
[urgent]Can you reconstruct datasets using the cache file (.arrow file)? 🤗Datasets	5	1088	August 27, 2021
Working with large datasets - cache issues 🤗Datasets	1	1044	June 1, 2022

How dataset.map() reads data

Related topics