I have big dataset un_pc
I want to load part like 200 k rows for start how do i do this
I have big dataset un_pc
I want to load part like 200 k rows for start how do i do this
I wonder if streaming would be a good option…
First, load the dataset from hf, then you can select how much rows you need.
For example:
from datasets import load_dataset
dataset_name = "Helsinki-NLP/un_pc"
dataset = load_dataset(dataset_name, split="train")
train_dataset = dataset.select(range(200000))
Note that the full dataset will be downloaded on your computer but only the selected 200k rows will be the train_dataset value
You can load a subset of your dataset in Hugging Face using the load_dataset()
function with filtering options. Here are a few ways to do it:
.select()
to Load a Specific Number of RowsIf your dataset is already loaded, you can select 200,000 rows like this:
from datasets import load_dataset
dataset = load_dataset("your_dataset_name")
subset = dataset["train"].select(range(200000)) # Select first 200k rows
If your dataset is too large to fit in memory, use streaming mode:
dataset = load_dataset("your_dataset_name", split="train", streaming=True)
subset = dataset.take(200000) # Load only 200k rows
This prevents downloading the entire dataset at once.
data_files
to Load a Specific FileIf your dataset consists of multiple files, you can load only a specific portion:
dataset = load_dataset("your_dataset_name", data_files={"train": "train_part1.csv"})
For more details, check out the Hugging Face documentation or community discussions. Let me know if you need help with a specific dataset!