I have a dataset that does not have a loading script. I’d like to make some small modifications at load time, basically making a call to ds.map
.
How can I create a data script that invokes the auto loader?
I’d like to do something like class MyLoader(AutoLoader)
and then override generate_examples
.
This requires overriding DatasetBuilder’s _post_process
method, which requires creating a loading script. Alternatively, you can explain in the README what modifications need to be applied before using the dataset (e.g., a snippet that loads the dataset and runs map
on it).
@mariosasko I tried inheriting from DatasetBuilder, but I got an error about _info
being undefined. I guess I was wondering if there’s a way to easily override the _post_process
method of the “default builder”.
@mariosasko
I’ve been looking at this, and it seems like there really isn’t an easy way.
A lot of the scriptless magic is happening in HubDatasetModuleFactoryWithoutScript and LocalDatasetModuleFactoryWithoutScript. But because this happens outside of the builder, it’s difficult to incorporate into a loading script. As an example, let’s say I want to enumerate data files that I have in my dataset repository. I could use glob for this, which would work locally. Or I could use a function to list the files in the repo, which would work on the hub. This is pretty awkward, and there’s a big gap between the fully automatic no loading script and even tweaking it.
An exacerbating factor is that most datasets with a loading script don’t host the data on huggingface. It took me a while to find a simple example of a dataset that had both: sts17-crosslingual-sts.py · mteb/sts17-crosslingual-sts at main Linking to this or a similar simple script somewhere in the documentation would be helpful.
Copying from the Parquet builder I wound up with this:
#!/usr/bin/python
import datasets
import pyarrow as pa
import pyarrow.parquet as pq
_DATA_FILES = ['data/combined-00009-of-00013-97a88bccf4215954.parquet',
'data/combined-00004-of-00013-119d653561443d7b.parquet',
'data/combined-00007-of-00013-ab54cce4ee6331d0.parquet',
'data/combined-00002-of-00013-149f5d0d22fe8f52.parquet',
'data/combined-00003-of-00013-426af6f6064e67dd.parquet',
'data/combined-00010-of-00013-89d7565c5f0d2e4e.parquet',
'data/combined-00000-of-00013-36d239509fb9e430.parquet',
'data/combined-00005-of-00013-363bba92a2b7f737.parquet',
'data/combined-00006-of-00013-4d4d574c9d87176e.parquet',
'data/combined-00001-of-00013-d5b44e96ad7d2927.parquet',
'data/combined-00012-of-00013-84cf41ef75dd5b76.parquet',
'data/combined-00011-of-00013-4c21766cedd5a4a0.parquet',
'data/combined-00008-of-00013-674f74b6f2288c61.parquet']
class OOMethodTestDataset(datasets.ArrowBasedBuilder):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def _info(self):
return datasets.DatasetInfo()
def _split_generators(self, dl_manager):
files = _DATA_FILES
downloaded_files = dl_manager.download(files)
#print(files)
#print(downloaded_files)
return [
datasets.SplitGenerator(
name="combined",
gen_kwargs={
"files": downloaded_files,
},
),
]
def _generate_tables(self, files):
for file_idx, file in enumerate(files):
with open(file, "rb") as f:
parquet_file = pq.ParquetFile(f)
try:
for batch_idx, record_batch in enumerate(
parquet_file.iter_batches(batch_size=10_000)
):
pa_table = pa.Table.from_batches([record_batch])
# Uncomment for debugging (will print the Arrow table size and elements)
# logger.warning(f"pa_table: {pa_table} num rows: {pa_table.num_rows}")
# logger.warning('\n'.join(str(pa_table.slice(i, 1).to_pydict()) for i in range(pa_table.num_rows)))
yield f"{file_idx}_{batch_idx}", pa_table
except ValueError as e:
#logger.error(f"Failed to read file '{file}' with error {type(e)}: {e}")
raise
I tried to inherit from the Parquet builder, but it was like swimming upstream.
It really doesn’t seem like it should be this hard to go from a non-loader script to loader script…
Another solution is to make a new dataset without a loading script, and load that dataset from a dataset with a loading script. For example:
map
/filter
require a dataset to be constructed, so I wouldn’t call using them inside a dataset script (before building the dataset) “tweaking”. We can consider making specifying (and sharing) post-processing methods easier if we get more requests like this.
I could use glob for this, which would work locally. Or I could use a function to list the files in the repo, which would work on the hub.
Globbing will become consistent once we start using hugginggface_hub.HfFileSystem
in datasets
(very soon)
I tried to inherit from the Parquet builder, but it was like swimming upstream.
It really doesn’t seem like it should be this hard to go from a non-loader script to loader script…
Making this simple is the goal of https://github.com/huggingface/datasets/pull/5331.
Another solution is to make a new dataset without a loading script, and load that dataset from a dataset with a loading script.
The only issue is that this caches the dataset twice (should be OK for smaller datasets).