What is the difference between accessing the underlying Arrow table of a dataset using .with_format("arraw") vs .data.table? And which one is recommended?
Example use case: computing max using pyarrow:
from datasets import load_dataset
import pyarrow.compute as pc
ds = load_dataset("mnist", split="train")
# Option 1
print(pc.max(ds.with_format("arrow")["label"]))
# Option 2
print(pc.max(ds.data.table["label"]))
Hi ! Both approaches are the same in general except when the dataset is shuffled (using .shuffle()) or if only certain indices are kept (using .train_test_split() for example). In that case the original/unshuffled data is in ds.data and there in an indices mapping that does the mapping from the shuffled dataset row index to the original data row index in ds._indices.
To summarize, ds.data contains the original/unshuffled data and .with_format('arrow') allows to manipulate the dataset in arrow format no matter what transformations were applied to the dataset. So you should use .with_format('arrow') in general