.gz supported or not supported?

Hi,
We work with a dataset in .gz format where each sample is a dict with various fields (meshes, arrays, etc.). On the website it is mentioned that .gz is supported.

The Hub natively supports multiple file formats:

    CSV (.csv, .tsv)
    JSON Lines, JSON (.jsonl, .json)
    Parquet (.parquet)
    Arrow streaming format (.arrow)
    Text (.txt)
    Images (.png, .jpg, etc.)
    Audio (.wav, .mp3, etc.)
    WebDataset (.tar)

It supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz).

But after successful creation of the dataset repository, I couldnt load the dataset (I tired full dataset and a single file) and received the error:

FileNotFoundError: Unable to find 'hf://datasets/varora/HIT@bb206cd44dcc34f859bb09b547b1fee6898954a7/male/train/0252.gz' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.h5', '.hdf', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.H5', '.HDF', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.zip']

In this error message, .gz is missing as a valid file extention.
I know my dataset repo is fine since i could load a single …png image in another folder successfully.

To reproduce:

dataset = load_dataset("varora/HIT")

If indeed there is a mistake in the docs, then why isnt this format supported? Its quite common in the medical community.

The docs are not super clear. I understand I need to create a custom data loading script. Is this the way to go?

# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# TODO: Address all TODOs and remove all explanatory comments
"""TODO: Add a description here."""


import csv
import json
import os
from glob import glob
import datasets
import pickle


# TODO: Add BibTeX citation
# Find for instance the citation on arxiv or on the dataset repo/website
_CITATION = """\
@inproceedings{Keller:CVPR:2024,
  title = {{HIT}: Estimating Internal Human Implicit Tissues from the Body Surface},
  author = {Keller, Marilyn and Arora, Vaibhav and Dakri, Abdelmouttaleb and Chandhok, Shivam and 
  Machann, Jürgen and Fritsche, Andreas and Black, Michael J. and Pujades, Sergi},   
  booktitle = {Proceedings IEEE/CVF Conf.~on Computer Vision and Pattern Recognition (CVPR)},
  month = jun,
  year = {2024},
  month_numeric = {6}}
"""

# TODO: Add description of the dataset here
# You can copy an official description
_DESCRIPTION = """\
The HIT dataset is a structured dataset of paired observations of body's inner tissues and the body surface. More concretely, it is a dataset of paired full-body volumetric segmented (bones, lean, and adipose tissue) MRI scans and SMPL meshes capturing the body surface shape for male (N=157) and female (N=241) subjects respectively. This is relevant for medicine, sports science, biomechanics, and computer graphics as it can ease the creation of personalized anatomic digital twins that model our bones, lean, and adipose tissue."""

# TODO: Add a link to an official homepage for the dataset here
_HOMEPAGE = "https://hit.is.tue.mpg.de/"

# TODO: Add the licence for the dataset here if you can find it
_LICENSE = "see https://huggingface.co/datasets/varora/HIT/blob/main/README.md"

# TODO: Add link to the official dataset URLs here
# The HuggingFace Datasets library doesn't host the datasets but only points to the original files.
# This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)
_PATHS = {
    "male": "/male",
    "female": "/female",
}

# TODO: Name of the dataset usually matches the script name with CamelCase instead of snake_case
class NewDataset(datasets.GeneratorBasedBuilder):
    """TODO: Short description of my dataset."""

    VERSION = datasets.Version("1.1.0")

    # This is an example of a dataset with multiple configurations.
    # If you don't want/need to define several sub-sets in your dataset,
    # just remove the BUILDER_CONFIG_CLASS and the BUILDER_CONFIGS attributes.

    # If you need to make complex sub-parts in the datasets with configurable options
    # You can create your own builder configuration class to store attribute, inheriting from datasets.BuilderConfig
    # BUILDER_CONFIG_CLASS = MyBuilderConfig

    # You will be able to load one or the other configurations in the following list with
    # data = datasets.load_dataset('my_dataset', 'first_domain')
    # data = datasets.load_dataset('my_dataset', 'second_domain')

    def _info(self):
        print("HELOOOOOOOOO")
        features = datasets.Features(
            {
                "gender": datasets.Value("string"),
                "mri_seg": datasets.Value("int64"),
                "mri_labels": datasets.Sequence(datasets.Sequence(datasets.Value("int64"))),
                "mri_seg_dict": datasets.Sequence(datasets.Sequence(datasets.Value("float"))),
                "resolution": datasets.Value("double"),
                "center": datasets.Value("double"),
                "smpl_dict": datasets.Sequence(datasets.Sequence(datasets.Value("double"))),
                "dataset_name": datasets.Value("string"),
                "subject_ID": datasets.Value("string")
                # These are the features of your dataset like images, labels ...
            }
        )
        return datasets.DatasetInfo(
            # This is the description that will appear on the datasets page.
            description=_DESCRIPTION,
            # This defines the different columns of the dataset and their types
            features=features,  # Here we define them above because they are different between the two configurations
            # If there's a common (input, target) tuple from the features, uncomment supervised_keys line below and
            # specify them. They'll be used if as_supervised=True in builder.as_dataset.
            # supervised_keys=("sentence", "label"),
            # Homepage of the dataset for documentation
            homepage=_HOMEPAGE,
            # License for the dataset if available
            license=_LICENSE,
            # Citation for the dataset
            citation=_CITATION,
        )

    def _split_generators(self, dl_manager):
        # TODO: This method is tasked with downloading/extracting the data and defining the splits depending on the configuration
        # If several configurations are possible (listed in BUILDER_CONFIGS), the configuration selected by the user is in self.config.name
        # dl_manager is a datasets.download.DownloadManager that can be used to download and extract URLS
        # It can accept any type or nested list/dict and will give back the same structure with the url replaced with path to local files.
        # By default the archives will be extracted and a path to a cached folder where they are extracted is returned instead of the archive
        rel_path = _PATHS[self.config.name]
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={
                    "filepath": os.path.join(rel_path, "train"),
                    "split": "train",
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={
                    "filepath": os.path.join(rel_path, "val"),
                    "split": "validation",
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.TEST,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={
                    "filepath": os.path.join(rel_path, "test"),
                    "split": "test"
                },
            ),
        ]

    # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
    def _generate_examples(self, filepath, split):
        # TODO: This method handles input defined in _split_generators to yield (key, example) tuples from the dataset.
        # The `key` is for legacy reasons (tfds) and is not important in itself, but must be unique for each example.
        # List all files in the path .gz
        file_paths = []
        for root, dirs, files in os.walk(filepath):
            for file in files:
                if file.endswith('.gz'):
                    file_paths.append(file)
        for subject_path in file_paths:
            with gzip.open(subject_path, 'rb') as f:
                data = pickle.load(f)
            key = data['subject_ID']
            yield key, data

But this gives me error which is very difficult to traceback in the custom loader script:

Traceback (most recent call last):
  File "/snap/pycharm-professional/378/plugins/python/helpers/pydev/pydevd.py", line 1534, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/snap/pycharm-professional/378/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/varora/PythonProjects/hf_hit/HIT/tmp.py", line 2, in <module>
    male_dataset = load_dataset("varora/hit", "male", split="test", verification_mode="no_checks")
  File "/home/varora/anaconda3/envs/llm/lib/python3.10/site-packages/datasets/load.py", line 2595, in load_dataset
    ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
  File "/home/varora/anaconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py", line 1244, in as_dataset
    datasets = map_nested(
  File "/home/varora/anaconda3/envs/llm/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 457, in map_nested
    return function(data_struct)
  File "/home/varora/anaconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py", line 1274, in _build_single_dataset
    ds = self._as_dataset(
  File "/home/varora/anaconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py", line 1348, in _as_dataset
    dataset_kwargs = ArrowReader(cache_dir, self.info).read(
  File "/home/varora/anaconda3/envs/llm/lib/python3.10/site-packages/datasets/arrow_reader.py", line 254, in read
    raise ValueError(msg)
ValueError: Instruction "test" corresponds to no data!

Hi ! you need your files to have their extension and then the compression extension, for example .jsonl.gz.

However it looks like your files are Pickle files, and Pickle is not a supported format on HF.
Indeed Pickle is an unsafe formar - it’s trivial to include a malware in a Pickle file and hack people with it.

I’d suggest you to use another file format like Parquet for example, which supports all the types including arrays