How do I structure this?

v2thegreat · February 17, 2025, 3:35pm

Hello. I’m fairly new to huggingface datasets, and I was hoping I could get some help with what I’m trying to do. I have a repo that looks something like this:

├── README.md
├── originals
│  ├── thumbnails
│  │   └── user1/
│  │       └── <files>
│  ├── timelapses
│  │   ├── user1/
│  │   │  └── <files>
│  │   ├── user2/
│  │   │  └── <files>
│  │   └── user3/
│  │     └── <files>
│  └── videos
│       └── user1/
│         └── <files>
└── metadata
   └── thumbnails_metadata.csv
   └── timelapses_metadata.csv
   └── videos_metadata.csv

I’m trying to make it so I can access all of these datasets (independenly), along with their metadata. I’ve been reading the docs, and nothing seems to be working.

Here’s what I have for trying to access to the metadata

from datasets import DatasetBuilder, Features, Value, DatasetInfo, SplitGenerator
import csv


class VideosDataset(DatasetBuilder):
    VERSION = "0.0.0"

    def _info(self):
        return DatasetInfo(
            features=Features(
                {
                    "time": Value("string"),
                    "user": Value("string"),
                    "path": Value("string"),
                    "duration": Value("float32"),
                    "num_frames": Value("int32"),
                }
            )
        )

    def _split_generators(self, dl_manager):
        return [
            SplitGenerator(
                name="default",
                gen_kwargs={"filepath": "metadata/videos.csv"},
            )
        ]

    def _generate_examples(self, filepath):
        with open(filepath, encoding="utf-8") as f:
            reader = csv.DictReader(f)
            for idx, row in enumerate(reader):
                yield idx, {
                    "time": row["time"],
                    "user": row["user"],
                    "path": row["path"],
                    "duration": float(row["duration"]),
                    "num_frames": int(row["num_frames"]),
                }

With this, I get the following error

>>> VideosDataset().as_dataset()

File /opt/miniconda3/envs/BambuAPIAndMore/lib/python3.12/site-packages/datasets/builder.py:1117, in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, in_memory)
   1110 if not os.path.exists(self._output_dir):
   1111     raise FileNotFoundError(
   1112         f"Dataset {self.dataset_name}: could not find data in {self._output_dir}. Please make sure to call "
   1113         "builder.download_and_prepare(), or use "
   1114         "datasets.load_dataset() before trying to access the Dataset object."
   1115     )
-> 1117 logger.debug(f'Constructing Dataset for split {split or ", ".join(self.info.splits)}, from {self._output_dir}')
   1119 # By default, return all splits
   1120 if split is None:

TypeError: can only join an iterable

I’ve been able to access the video data by using the following

>>> from datasets import load_dataset


>>> ds = load_dataset("originals/videos/Borillion/", trust_remote_code=True).cast_column(
    "video", Video(decode=False)
)

I’m unsure if this would work when working with the dataset remotely, and I don’t think I’m doing this correctly

My repo can be found here.

John6666 · February 17, 2025, 4:07pm

It might be easier to use the conventional method.

v2thegreat · February 19, 2025, 5:29am

Omg, I feel silly for having missed this. Going to test things out and (hopefully) close this if it resolves my problems. Thanks!

Topic		Replies	Views
Multiple Custom PyTorch Datasets 🤗Datasets	3	42	January 26, 2025
Help with speech dataset loading script 🤗Datasets	2	269	November 28, 2023
Custom loading dataset script 🤗Datasets	4	511	January 3, 2023
Create HF dataset from h5 🤗Datasets	3	2328	October 20, 2021
Using load_datasets for newly created datasets 🤗Datasets	2	456	August 27, 2021

How do I structure this?

Related topics