Difference between `.with_format("arrow")` and `.data.table`

tahayassine · November 24, 2024, 11:30pm

Hi,

What is the difference between accessing the underlying Arrow table of a dataset using .with_format("arraw") vs .data.table? And which one is recommended?

Example use case: computing max using pyarrow:

from datasets import load_dataset
import pyarrow.compute as pc

ds = load_dataset("mnist", split="train")

# Option 1
print(pc.max(ds.with_format("arrow")["label"]))

# Option 2
print(pc.max(ds.data.table["label"]))

Thanks!

lhoestq · November 30, 2024, 3:59pm

Hi ! Both approaches are the same in general except when the dataset is shuffled (using .shuffle()) or if only certain indices are kept (using .train_test_split() for example). In that case the original/unshuffled data is in ds.data and there in an indices mapping that does the mapping from the shuffled dataset row index to the original data row index in ds._indices.

To summarize, ds.data contains the original/unshuffled data and .with_format('arrow') allows to manipulate the dataset in arrow format no matter what transformations were applied to the dataset. So you should use .with_format('arrow') in general

tahayassine · December 1, 2024, 12:51am

Makes sense, thanks!

system · December 1, 2024, 12:51pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Datasets + Arrow Help Beginners	2	1381	June 9, 2022
Iterable datasets for array data, limited formatting options 🤗Datasets	2	422	December 28, 2023
Explain why datasets.map is faster compared to other similar libraries 🤗Datasets	4	882	September 6, 2022
Dataset set_format 🤗Datasets	11	10323	November 24, 2024
Does saving a shuffled dataset to arrow format eliminate the indirection? 🤗Datasets	3	97	December 4, 2024

Difference between `.with_format("arrow")` and `.data.table`

Related topics