How to tweak a dataset without a loading script?

I have a dataset that does not have a loading script. I’d like to make some small modifications at load time, basically making a call to

How can I create a data script that invokes the auto loader?

I’d like to do something like class MyLoader(AutoLoader) and then override generate_examples.

This requires overriding DatasetBuilder’s _post_process method, which requires creating a loading script. Alternatively, you can explain in the README what modifications need to be applied before using the dataset (e.g., a snippet that loads the dataset and runs map on it).

@mariosasko I tried inheriting from DatasetBuilder, but I got an error about _info being undefined. I guess I was wondering if there’s a way to easily override the _post_process method of the “default builder”.


I’ve been looking at this, and it seems like there really isn’t an easy way.

A lot of the scriptless magic is happening in HubDatasetModuleFactoryWithoutScript and LocalDatasetModuleFactoryWithoutScript. But because this happens outside of the builder, it’s difficult to incorporate into a loading script. As an example, let’s say I want to enumerate data files that I have in my dataset repository. I could use glob for this, which would work locally. Or I could use a function to list the files in the repo, which would work on the hub. This is pretty awkward, and there’s a big gap between the fully automatic no loading script and even tweaking it.

An exacerbating factor is that most datasets with a loading script don’t host the data on huggingface. It took me a while to find a simple example of a dataset that had both: · mteb/sts17-crosslingual-sts at main Linking to this or a similar simple script somewhere in the documentation would be helpful.

Copying from the Parquet builder I wound up with this:


import datasets

import pyarrow as pa
import pyarrow.parquet as pq

_DATA_FILES = ['data/combined-00009-of-00013-97a88bccf4215954.parquet',

class OOMethodTestDataset(datasets.ArrowBasedBuilder):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def _info(self):
        return datasets.DatasetInfo()

    def _split_generators(self, dl_manager):
        files = _DATA_FILES
        downloaded_files =


        return [
                    "files": downloaded_files,
    def _generate_tables(self, files):
        for file_idx, file in enumerate(files):
            with open(file, "rb") as f:
                parquet_file = pq.ParquetFile(f)
                    for batch_idx, record_batch in enumerate(
                        pa_table = pa.Table.from_batches([record_batch])
                        # Uncomment for debugging (will print the Arrow table size and elements)
                        # logger.warning(f"pa_table: {pa_table} num rows: {pa_table.num_rows}")
                        # logger.warning('\n'.join(str(pa_table.slice(i, 1).to_pydict()) for i in range(pa_table.num_rows)))
                        yield f"{file_idx}_{batch_idx}", pa_table
                except ValueError as e:
                    #logger.error(f"Failed to read file '{file}' with error {type(e)}: {e}")

I tried to inherit from the Parquet builder, but it was like swimming upstream.

It really doesn’t seem like it should be this hard to go from a non-loader script to loader script…

Another solution is to make a new dataset without a loading script, and load that dataset from a dataset with a loading script. For example:

map/filter require a dataset to be constructed, so I wouldn’t call using them inside a dataset script (before building the dataset) “tweaking”. We can consider making specifying (and sharing) post-processing methods easier if we get more requests like this.

I could use glob for this, which would work locally. Or I could use a function to list the files in the repo, which would work on the hub.

Globbing will become consistent once we start using hugginggface_hub.HfFileSystem in datasets (very soon)

I tried to inherit from the Parquet builder, but it was like swimming upstream.

It really doesn’t seem like it should be this hard to go from a non-loader script to loader script…

Making this simple is the goal of

Another solution is to make a new dataset without a loading script, and load that dataset from a dataset with a loading script.

The only issue is that this caches the dataset twice (should be OK for smaller datasets).