Hi, I’m just getting started and am excited that Datasets is built on Arrow. But I haven’t seen how to access the Arrow data. For example, how do I use pyarrow or Polars on loaded training data?
You should be able to access the underlying Arrow data through a datasets _data
. Note that such usage is not intended, though. EDIT: see @mariosasko’s reply. I was a bit too quick, you also have a public property data
that you can use.
Hi! The underlying Arrow table can be accessed using the dset.data.table
attribute, which can then be loaded in Polars as follows:
import polars as pl
from datasets import load_dataset
dset = load_dataset(...)
df = pl.from_arrow(dset.data.table)
1 Like