Couldn't find 'my_dataset' on the Hugging Face Hub

I was following this huggingface tutorial on uploading my dataset (a json file) to the Hub. In the link they mention:

or text data extensions like .csv, .json, .jsonl, and .txt, we recommend compressing them before uploading to the Hub (to .zip or .gz file extension for example)

So I converted my json file into a gz file. I uploaded it to the Hub in a public repo following their steps and under Files and versions I currently have 2 files: .gitattributes and train.gz.

I try to load my dataset with

from datasets import load_dataset
my_dataset = load_dataset('my_username/my_dataset')

But I’m getting the error

FileNotFoundError: Couldn't find a dataset script at my_local_path or any data file in the same directory. Couldn't find 'my_username/my_dataset' on the Hugging Face Hub either: FileNotFoundError: Unable to find train.gz in dataset repository my_username/my_dataset with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'blp', 'bmp', 'dib', 'bufr', 'cur', 'pcx', 'dcx', 'dds', 'ps', 'eps', 'fit', 'fits', 'fli', 'flc', 'ftc', 'ftu', 'gbr', 'gif', 'grib', 'h5', 'hdf', 'png', 'apng', 'jp2', 'j2k', 'jpc', 'jpf', 'jpx', 'j2c', 'icns', 'ico', 'im', 'iim', 'tif', 'tiff', 'jfif', 'jpe', 'jpg', 'jpeg', 'mpg', 'mpeg', 'msp', 'pcd', 'pxr', 'pbm', 'pgm', 'ppm', 'pnm', 'psd', 'bw', 'rgb', 'rgba', 'sgi', 'ras', 'tga', 'icb', 'vda', 'vst', 'webp', 'wmf', 'emf', 'xbm', 'xpm', 'BLP', 'BMP', 'DIB', 'BUFR', 'CUR', 'PCX', 'DCX', 'DDS', 'PS', 'EPS', 'FIT', 'FITS', 'FLI', 'FLC', 'FTC', 'FTU', 'GBR', 'GIF', 'GRIB', 'H5', 'HDF', 'PNG', 'APNG', 'JP2', 'J2K', 'JPC', 'JPF', 'JPX', 'J2C', 'ICNS', 'ICO', 'IM', 'IIM', 'TIF', 'TIFF', 'JFIF', 'JPE', 'JPG', 'JPEG', 'MPG', 'MPEG', 'MSP', 'PCD', 'PXR', 'PBM', 'PGM', 'PPM', 'PNM', 'PSD', 'BW', 'RGB', 'RGBA', 'SGI', 'RAS', 'TGA', 'ICB', 'VDA', 'VST', 'WEBP', 'WMF', 'EMF', 'XBM', 'XPM', 'aiff', 'au', 'avr', 'caf', 'flac', 'htk', 'svx', 'mat4', 'mat5', 'mpc2k', 'ogg', 'paf', 'pvf', 'raw', 'rf64', 'sd2', 'sds', 'ircam', 'voc', 'w64', 'wav', 'nist', 'wavex', 'wve', 'xi', 'mp3', 'opus', 'AIFF', 'AU', 'AVR', 'CAF', 'FLAC', 'HTK', 'SVX', 'MAT4', 'MAT5', 'MPC2K', 'OGG', 'PAF', 'PVF', 'RAW', 'RF64', 'SD2', 'SDS', 'IRCAM', 'VOC', 'W64', 'WAV', 'NIST', 'WAVEX', 'WVE', 'XI', 'MP3', 'OPUS', 'zip']

Is there anything special I need to do to load it? Or do I need to add other files that they did not mention in order to load it?

Hi ! You need to replace “my_username” and “my_dataset” by your username on HF and by the name of the dataset repository you created

Hi, yes I know that, I kept it here general but I did that.

Oh ok sorry. Then it might simply be a dataset format issue: can you try renaming train.gz → train.json.gz ?

1 Like

Seems to work! Thanks!