Custom loading dataset script

sl02 · December 25, 2022, 11:24am

I have prepared and tested a custom loading dataset script. Although the test was successful, it hasn’t created the metadata file dataset_info.json in my dataset folder. Instead it has created a README.md file. What am I missing here?

Output from the test

datasets-cli test datasets/tse --save_infos --all_configs
Testing builder 'entity' (1/2)
Downloading and preparing dataset tse/entity to /Users/home/.cache/huggingface/datasets/tse/entity/1.0.0/7f33fefc7622e65f7e037dda1222c13ada1717400eef6ed51970adf336261e21...
Dataset tse downloaded and prepared to /Users/home/.cache/huggingface/datasets/tse/entity/1.0.0/7f33fefc7622e65f7e037dda1222c13ada1717400eef6ed51970adf336261e21. Subsequent calls will reuse this data.                                                      
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 321.33it/s]
Dataset card saved at datasets/tse/README.md
Testing builder 'queries' (2/2)
Downloading and preparing dataset tse/queries to /Users/home/.cache/huggingface/datasets/tse/queries/1.0.0/7f33fefc7622e65f7e037dda1222c13ada1717400eef6ed51970adf336261e21...
Dataset tse downloaded and prepared to /Users/home/.cache/huggingface/datasets/tse/queries/1.0.0/7f33fefc7622e65f7e037dda1222c13ada1717400eef6ed51970adf336261e21. Subsequent calls will reuse this data.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.49it/s]
Dataset card saved at datasets/tse/README.md
Test successful.

Directory listing

(base) home@sls-MacBook-Pro tse % ls -l
total 32
-rw-r--r--  1 home  staff  1311 Dec 25 16:18 README.md
drwxr-xr-x  4 home  staff   128 Dec 25 14:20 data
-rw-r--r--  1 home  staff  9500 Dec 25 16:16 tse.py

For the final test, why do we need to clone the huggingface/datasets repo to run the following command
RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_real_dataset_tse

Could you please explain?

sl02 · December 25, 2022, 1:50pm

I was looking at the older version of the instructions, here dataset script

Just to be certain, we don’t require a dataset_info.json file. Is this correct?

sl02 · December 25, 2022, 2:06pm

Got another query.

How do we pass additional arguments to the _generate_examples() method from the load_dataset() function?

sl02 · December 25, 2022, 3:28pm

I figured out how to pass additional arguments to the function. Thanks!

If you could answer my earlier question, I would appreciate it.

lhoestq · January 3, 2023, 11:05am

Hi ! We now store the dataset infos into the README.md yaml tags instead of using a JSON file, in order to have everything in one place

Topic		Replies	Views
Testing and dummy data required for dataset loading script? 🤗Datasets	2	708	August 8, 2022
Some issues about loading script of datasets 🤗Datasets	0	27	July 31, 2024
Sharing a community provided dataset Beginners	3	450	October 4, 2020
Using load_datasets for newly created datasets 🤗Datasets	2	456	August 27, 2021
How to modify loaded dataset 🤗Datasets	1	7789	February 27, 2023

Custom loading dataset script

Related topics