How to tweak a dataset without a loading script?

I have a dataset that does not have a loading script. I’d like to make some small modifications at load time, basically making a call to ds.map.

How can I create a data script that invokes the auto loader?

I’d like to do something like class MyLoader(AutoLoader) and then override generate_examples.

This requires overriding DatasetBuilder’s _post_process method, which requires creating a loading script. Alternatively, you can explain in the README what modifications need to be applied before using the dataset (e.g., a snippet that loads the dataset and runs map on it).

@mariosasko I tried inheriting from DatasetBuilder, but I got an error about _info being undefined. I guess I was wondering if there’s a way to easily override the _post_process method of the “default builder”.

@mariosasko

I’ve been looking at this, and it seems like there really isn’t an easy way.

A lot of the scriptless magic is happening in HubDatasetModuleFactoryWithoutScript and LocalDatasetModuleFactoryWithoutScript. But because this happens outside of the builder, it’s difficult to incorporate into a loading script. As an example, let’s say I want to enumerate data files that I have in my dataset repository. I could use glob for this, which would work locally. Or I could use a function to list the files in the repo, which would work on the hub. This is pretty awkward, and there’s a big gap between the fully automatic no loading script and even tweaking it.

An exacerbating factor is that most datasets with a loading script don’t host the data on huggingface. It took me a while to find a simple example of a dataset that had both: sts17-crosslingual-sts.py · mteb/sts17-crosslingual-sts at main Linking to this or a similar simple script somewhere in the documentation would be helpful.

Copying from the Parquet builder I wound up with this:

#!/usr/bin/python

import datasets

import pyarrow as pa
import pyarrow.parquet as pq

_DATA_FILES = ['data/combined-00009-of-00013-97a88bccf4215954.parquet',
 'data/combined-00004-of-00013-119d653561443d7b.parquet',
 'data/combined-00007-of-00013-ab54cce4ee6331d0.parquet',
 'data/combined-00002-of-00013-149f5d0d22fe8f52.parquet',
 'data/combined-00003-of-00013-426af6f6064e67dd.parquet',
 'data/combined-00010-of-00013-89d7565c5f0d2e4e.parquet',
 'data/combined-00000-of-00013-36d239509fb9e430.parquet',
 'data/combined-00005-of-00013-363bba92a2b7f737.parquet',
 'data/combined-00006-of-00013-4d4d574c9d87176e.parquet',
 'data/combined-00001-of-00013-d5b44e96ad7d2927.parquet',
 'data/combined-00012-of-00013-84cf41ef75dd5b76.parquet',
 'data/combined-00011-of-00013-4c21766cedd5a4a0.parquet',
 'data/combined-00008-of-00013-674f74b6f2288c61.parquet']

class OOMethodTestDataset(datasets.ArrowBasedBuilder):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def _info(self):
        return datasets.DatasetInfo()

    def _split_generators(self, dl_manager):
        files = _DATA_FILES
        downloaded_files = dl_manager.download(files)

        #print(files)
        #print(downloaded_files)

        return [
            datasets.SplitGenerator(
                name="combined",
                gen_kwargs={
                    "files": downloaded_files,
                },
            ),
        ]
    
    def _generate_tables(self, files):
        for file_idx, file in enumerate(files):
            with open(file, "rb") as f:
                parquet_file = pq.ParquetFile(f)
                try:
                    for batch_idx, record_batch in enumerate(
                        parquet_file.iter_batches(batch_size=10_000)
                    ):
                        pa_table = pa.Table.from_batches([record_batch])
                        # Uncomment for debugging (will print the Arrow table size and elements)
                        # logger.warning(f"pa_table: {pa_table} num rows: {pa_table.num_rows}")
                        # logger.warning('\n'.join(str(pa_table.slice(i, 1).to_pydict()) for i in range(pa_table.num_rows)))
                        yield f"{file_idx}_{batch_idx}", pa_table
                except ValueError as e:
                    #logger.error(f"Failed to read file '{file}' with error {type(e)}: {e}")
                    raise

I tried to inherit from the Parquet builder, but it was like swimming upstream.

It really doesn’t seem like it should be this hard to go from a non-loader script to loader script…

Another solution is to make a new dataset without a loading script, and load that dataset from a dataset with a loading script. For example:

map/filter require a dataset to be constructed, so I wouldn’t call using them inside a dataset script (before building the dataset) “tweaking”. We can consider making specifying (and sharing) post-processing methods easier if we get more requests like this.

I could use glob for this, which would work locally. Or I could use a function to list the files in the repo, which would work on the hub.

Globbing will become consistent once we start using hugginggface_hub.HfFileSystem in datasets (very soon)

I tried to inherit from the Parquet builder, but it was like swimming upstream.

It really doesn’t seem like it should be this hard to go from a non-loader script to loader script…

Making this simple is the goal of https://github.com/huggingface/datasets/pull/5331.

Another solution is to make a new dataset without a loading script, and load that dataset from a dataset with a loading script.

The only issue is that this caches the dataset twice (should be OK for smaller datasets).