Accessing to OSCAR data-set

saied · June 29, 2021, 6:43am

Hi guys,
I wanted to check the oscar data set and create a subsample of it for the experiment. I followed the instruction in the documentation:

from datasets import load_dataset

dataset = load_dataset("oscar", "unshuffled_deduplicated_fa")

and it return

    train: Dataset({
        features: ['id', 'text'],
        num_rows: 8203495
    })
})

my question was how can I access the text data it self.
The directory that pops up when I loading data set is
/root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_fa/1.0.0/e4f06cecc7ae02f7adf85640b4019bf476d44453f251a1d84aebae28b0f8d51d

Thanks

thomwolf · July 6, 2021, 7:43am

dataset['train']['text'] for instance

Topic		Replies	Views
How to download subset of of a dataset scripted 🤗Datasets	6	6007	December 7, 2023
Loading dataset with streaming model Beginners	4	998	March 11, 2024
Batch processing for stream dataset Intermediate	0	592	August 12, 2022
Loading a fraction of data 🤗Datasets	5	5210	May 12, 2023
Understanding set_transform 🤗Datasets	10	7532	March 9, 2021

Accessing to OSCAR data-set

Related topics