Datasets + Arrow Help

Hi, I’m just getting started and am excited that Datasets is built on Arrow. But I haven’t seen how to access the Arrow data. For example, how do I use pyarrow or Polars on loaded training data?

You should be able to access the underlying Arrow data through a datasets _data. Note that such usage is not intended, though. EDIT: see @mariosasko’s reply. I was a bit too quick, you also have a public property data that you can use.

Hi! The underlying Arrow table can be accessed using the dset.data.table attribute, which can then be loaded in Polars as follows:

import polars as pl
from datasets import load_dataset
dset = load_dataset(...)
df = pl.from_arrow(dset.data.table)
1 Like