Some datasets are huge, which makes it impractical to load all of it from Hf with load_dataset(), when debugging the code. Therefore one needs just load part of the dataset, say the first 10k rows. But how?
I know it is possible to load a part of dataset to memory with “slice splitting”, but it appears that it first downloads the whole dataset if it is not cached.
Faced this issue as well so wrote a short script that pulls a hub dataset, creates a small sample of it, and pushes the sample data to the hub as a new dataset.
def create_sample_dataset(full_dataset_name, sample_count=100, username="my-username", cache_dir="./dataset"):
# Create a directory to save the sampled dataset
os.makedirs(cache_dir, exist_ok=True)
# Get the dataset name
dataset_name = full_dataset_name.split("/")[-1]
dataset_name_sample = f"{dataset_name}-sample-{sample_count}"
# Load the dataset
dataset = datasets.load_dataset(full_dataset_name, cache_dir=cache_dir)
# Sample 100 rows from the training split (or modify for other splits)
train_sample = dataset["train"].shuffle(seed=42).select(range(sample_count))
test_sample = dataset["test"].shuffle(seed=42).select(range(sample_count))
# Push to hub
train_sample.push_to_hub(dataset_name_sample, split="train")
print("INFO: Train split pushed to the hub successfully")
test_sample.push_to_hub(dataset_name_sample, split="test")
print("INFO: Test split pushed to the hub successfully")
Once sampled / pushed you have a smaller version of your dataset in the hub to pull from.