Difference between `.with_format("arrow")` and `.data.table`

Hi,

What is the difference between accessing the underlying Arrow table of a dataset using .with_format("arraw") vs .data.table? And which one is recommended?

Example use case: computing max using pyarrow:

from datasets import load_dataset
import pyarrow.compute as pc

ds = load_dataset("mnist", split="train")

# Option 1
print(pc.max(ds.with_format("arrow")["label"]))

# Option 2
print(pc.max(ds.data.table["label"]))

Thanks!

1 Like

Hi ! Both approaches are the same in general except when the dataset is shuffled (using .shuffle()) or if only certain indices are kept (using .train_test_split() for example). In that case the original/unshuffled data is in ds.data and there in an indices mapping that does the mapping from the shuffled dataset row index to the original data row index in ds._indices.

To summarize, ds.data contains the original/unshuffled data and .with_format('arrow') allows to manipulate the dataset in arrow format no matter what transformations were applied to the dataset. So you should use .with_format('arrow') in general :slight_smile:

1 Like

Makes sense, thanks!

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.