Hello. Iβm fairly new to huggingface datasets, and I was hoping I could get some help with what Iβm trying to do. I have a repo that looks something like this:
βββ README.md
βββ originals
β βββ thumbnails
β β βββ user1/
β β βββ <files>
β βββ timelapses
β β βββ user1/
β β β βββ <files>
β β βββ user2/
β β β βββ <files>
β β βββ user3/
β β βββ <files>
β βββ videos
β βββ user1/
β βββ <files>
βββ metadata
βββ thumbnails_metadata.csv
βββ timelapses_metadata.csv
βββ videos_metadata.csv
Iβm trying to make it so I can access all of these datasets (independenly), along with their metadata. Iβve been reading the docs, and nothing seems to be working.
Hereβs what I have for trying to access to the metadata
from datasets import DatasetBuilder, Features, Value, DatasetInfo, SplitGenerator
import csv
class VideosDataset(DatasetBuilder):
VERSION = "0.0.0"
def _info(self):
return DatasetInfo(
features=Features(
{
"time": Value("string"),
"user": Value("string"),
"path": Value("string"),
"duration": Value("float32"),
"num_frames": Value("int32"),
}
)
)
def _split_generators(self, dl_manager):
return [
SplitGenerator(
name="default",
gen_kwargs={"filepath": "metadata/videos.csv"},
)
]
def _generate_examples(self, filepath):
with open(filepath, encoding="utf-8") as f:
reader = csv.DictReader(f)
for idx, row in enumerate(reader):
yield idx, {
"time": row["time"],
"user": row["user"],
"path": row["path"],
"duration": float(row["duration"]),
"num_frames": int(row["num_frames"]),
}
With this, I get the following error
>>> VideosDataset().as_dataset()
File /opt/miniconda3/envs/BambuAPIAndMore/lib/python3.12/site-packages/datasets/builder.py:1117, in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, in_memory)
1110 if not os.path.exists(self._output_dir):
1111 raise FileNotFoundError(
1112 f"Dataset {self.dataset_name}: could not find data in {self._output_dir}. Please make sure to call "
1113 "builder.download_and_prepare(), or use "
1114 "datasets.load_dataset() before trying to access the Dataset object."
1115 )
-> 1117 logger.debug(f'Constructing Dataset for split {split or ", ".join(self.info.splits)}, from {self._output_dir}')
1119 # By default, return all splits
1120 if split is None:
TypeError: can only join an iterable
Iβve been able to access the video data by using the following
>>> from datasets import load_dataset
>>> ds = load_dataset("originals/videos/Borillion/", trust_remote_code=True).cast_column(
"video", Video(decode=False)
)
Iβm unsure if this would work when working with the dataset remotely, and I donβt think Iβm doing this correctly
My repo can be found here.