How do I structure this?

Hello. I’m fairly new to huggingface datasets, and I was hoping I could get some help with what I’m trying to do. I have a repo that looks something like this:

β”œβ”€β”€ README.md
β”œβ”€β”€ originals
β”‚  β”œβ”€β”€ thumbnails
β”‚  β”‚   └── user1/
β”‚  β”‚       └── <files>
β”‚  β”œβ”€β”€ timelapses
β”‚  β”‚   β”œβ”€β”€ user1/
β”‚  β”‚   β”‚  └── <files>
β”‚  β”‚   β”œβ”€β”€ user2/
β”‚  β”‚   β”‚  └── <files>
β”‚  β”‚   └── user3/
β”‚  β”‚     └── <files>
β”‚  └── videos
β”‚       └── user1/
β”‚         └── <files>
└── metadata
   └── thumbnails_metadata.csv
   └── timelapses_metadata.csv
   └── videos_metadata.csv 

I’m trying to make it so I can access all of these datasets (independenly), along with their metadata. I’ve been reading the docs, and nothing seems to be working.

Here’s what I have for trying to access to the metadata

from datasets import DatasetBuilder, Features, Value, DatasetInfo, SplitGenerator
import csv


class VideosDataset(DatasetBuilder):
    VERSION = "0.0.0"

    def _info(self):
        return DatasetInfo(
            features=Features(
                {
                    "time": Value("string"),
                    "user": Value("string"),
                    "path": Value("string"),
                    "duration": Value("float32"),
                    "num_frames": Value("int32"),
                }
            )
        )

    def _split_generators(self, dl_manager):
        return [
            SplitGenerator(
                name="default",
                gen_kwargs={"filepath": "metadata/videos.csv"},
            )
        ]

    def _generate_examples(self, filepath):
        with open(filepath, encoding="utf-8") as f:
            reader = csv.DictReader(f)
            for idx, row in enumerate(reader):
                yield idx, {
                    "time": row["time"],
                    "user": row["user"],
                    "path": row["path"],
                    "duration": float(row["duration"]),
                    "num_frames": int(row["num_frames"]),
                }

With this, I get the following error

>>> VideosDataset().as_dataset()

File /opt/miniconda3/envs/BambuAPIAndMore/lib/python3.12/site-packages/datasets/builder.py:1117, in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, in_memory)
   1110 if not os.path.exists(self._output_dir):
   1111     raise FileNotFoundError(
   1112         f"Dataset {self.dataset_name}: could not find data in {self._output_dir}. Please make sure to call "
   1113         "builder.download_and_prepare(), or use "
   1114         "datasets.load_dataset() before trying to access the Dataset object."
   1115     )
-> 1117 logger.debug(f'Constructing Dataset for split {split or ", ".join(self.info.splits)}, from {self._output_dir}')
   1119 # By default, return all splits
   1120 if split is None:

TypeError: can only join an iterable

I’ve been able to access the video data by using the following

>>> from datasets import load_dataset


>>> ds = load_dataset("originals/videos/Borillion/", trust_remote_code=True).cast_column(
    "video", Video(decode=False)
)

I’m unsure if this would work when working with the dataset remotely, and I don’t think I’m doing this correctly

My repo can be found here.

1 Like

It might be easier to use the conventional method.

Omg, I feel silly for having missed this. Going to test things out and (hopefully) close this if it resolves my problems. Thanks!

1 Like