Hi @brando,
you can get the parquet files for every config by clicking Auto-converted to Parquet
.
For example, for the hacker_news train split, it would send to EleutherAI/pile at refs/convert/parquet.
Also note that if you click on API
,
you have access to the REST API endpoints
So you can download:
- the list of split names, for each config (https://datasets-server.huggingface.co/splits?dataset=EleutherAI%2Fpile, or even https://datasets-server.huggingface.co/splits?dataset=EleutherAI%2Fpile&config=hacker_news to get only the splits for the
hacker_news
config). As you can see, some configs only have 1 split, while other have up to 3 splits - any range of data for a given split, eg: https://datasets-server.huggingface.co/rows?dataset=EleutherAI%2Fpile&config=hacker_news&split=train&offset=0&limit=100 gives you the 100 first rows for the
hacker_news
train split - the list of parquet files I mentioned above, with https://huggingface.co/api/datasets/EleutherAI/pile/parquet/hacker_news/train