Unable to load all raw text files from dataset

Jeffrey04 · November 22, 2022, 10:17am

So I have a local dataset which contains pure text documents, one article per file

siang@kaid:~/dataset/source-txt$ ls
source-folder-sub_folder-sub_sub_folder-articles-all-id-99643.txt
source-folder-sub_folder-sub_sub_folder-articles-all-id-9977.txt
...
siang@kaid:~/dataset/source-txt$ ls | wc -l
634855

and then when I attempt to load them

>>> from datasets import load_dataset                                                                                                                                        
>>> ds = load_dataset('/home/siang/dataset/source-txt')
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 4453/4453 [00:00<00:00, 25370.20it/s]
Using custom data configuration source-txt-93721ea9880f0183
Downloading and preparing dataset text/source-txt to /home/siang/.cache/huggingface/datasets/text/source-txt-93721ea9880f0183/0.0.0/21a506d1b2b34316b1e82d0
bd79066905d846e5d7e619823c0dd338d6f1fa6ad...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.89it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.64it/s]
Dataset text downloaded and prepared to /home/siang/.cache/huggingface/datasets/text/source-txt-93721ea9880f0183/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619
823c0dd338d6f1fa6ad. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 65.61it/s]
>>> ds                 
DatasetDict({       
    train: Dataset({
        features: ['text'],
        num_rows: 126843
    })                                                                                                                                                                       
})                                
>>> ds                               
DatasetDict({                                                                                                                                                                
    train: Dataset({                                                                  
        features: ['text'],                                                                                                                                                  
        num_rows: 126843                                                              
    })                                                                                                                                                                       
})

So I am not getting all 600k documents in the dataset. Previously I grouped all the data into csv/json according to the folder structure, so it would be

siang@kaid:~/dataset/source-json$ ls
source-folder-sub_folder-sub_sub_folder1.json
source-folder-sub_folder-sub_sub_folder2.json
...
siang@kaid:~/dataset/source-json$ ls | wc -l
93

Much lower num_rows count (ignore the split to train and validation).

DatasetDict({                                                                                                                                                                
    train: Dataset({                                                                                                                                                         
        features: ['text'],                                                                                                                                                  
        num_rows: 4230                                                                                                                                                           })                                                                                                                                                                       
    validation: Dataset({                                                                                                                                                    
        features: ['text'],                                                                                                                                                  
        num_rows: 223                                                                                                                                                        
    })                                                                                                                                                                       
})

load_dataset would recognize much lower num_rows. Is there anything I can do to ensure all documents are being loaded properly? Previously we tried lumping everything in a 2GB text file, it would work but managing it is hard, hence we break the only big file into smaller ones.

Jeffrey04 · November 24, 2022, 5:58am

anyway, after simplifying the naming of data files, i get it to work

Topic		Replies	Views
Trying to Build Datasets, Random Items Get Added Beginners	2	484	July 27, 2021
Can load_datasets load entire text files instead of splitting on new lines? Beginners	1	1746	February 14, 2022
Load_dataset did not load the text file? Beginners	0	653	June 7, 2021
HF Datasets loading csv Beginners	1	1102	January 30, 2021
How to combine local data files with an official 🤗 dataset 🤗Datasets	15	3603	April 7, 2021

Unable to load all raw text files from dataset

Related topics