Within a dataset loading script, is it possible to custom_download data that is not found at a simple URL? I am querying datasets from an API in a custom format and reformatting into json files. I’m attempting to write shards of each query out but I get the following error: Cannot find the requested files in the cached path at...
.
Here’s a snippet from my custom_download_func. How do I create a properly structured data repository if the download function automatically writes out to the cache dir?
for i,shard in enumerate(divide_chunks(data_dict, shard_size)):
with gzip.open(path.joinpath(f"{id}_shard_{i}.jsonl.gz"), 'wt', encoding='UTF-8') as zf:
ndjson.dump(shard, zf)