I was wondering if is there a way to download only part of the data of a dataset.
In my specific case, I need to download only X samples from oscar English split (X~100K samples).
When I try to invoke the dataset builder it asks for >1TB of space so I think it will download the full set of data at the beginning.
When I stream a dataset that’s too big, it always seems to get stuck after certain point (program hanging and not making progress), like after 10000 samples etc. After a long time it says something like reconnect to data host. Any reasons for that? Thanks!
It depends on the host. Some datasets are hosted on HF, but some others have their data files hosted on the original dataset author/platform. You can check how the dataset is loaded by checking its repository on HF. Which dataset did you try to load ?
thanks for the answer, but what if I want to just specify the number of files I want without actually listing their names, is it possible to do that? lets say I just want to download 20 data files without listing their names in data_files
Faced this issue as well so wrote a short script that pulls a hub dataset, creates a small sample of it, and pushes the sample data to the hub as a new dataset.
def create_sample_dataset(full_dataset_name, sample_count=100, username="my-username", cache_dir="./dataset"):
# Create a directory to save the sampled dataset
os.makedirs(cache_dir, exist_ok=True)
# Get the dataset name
dataset_name = full_dataset_name.split("/")[-1]
dataset_name_sample = f"{dataset_name}-sample-{sample_count}"
# Load the dataset
dataset = datasets.load_dataset(full_dataset_name, cache_dir=cache_dir)
# Sample 100 rows from the training split (or modify for other splits)
train_sample = dataset["train"].shuffle(seed=42).select(range(sample_count))
test_sample = dataset["test"].shuffle(seed=42).select(range(sample_count))
# Push to hub
train_sample.push_to_hub(dataset_name_sample, split="train")
print("INFO: Train split pushed to the hub successfully")
test_sample.push_to_hub(dataset_name_sample, split="test")
print("INFO: Test split pushed to the hub successfully")