Unable to load all raw text files from dataset

So I have a local dataset which contains pure text documents, one article per file

siang@kaid:~/dataset/source-txt$ ls
source-folder-sub_folder-sub_sub_folder-articles-all-id-99643.txt
source-folder-sub_folder-sub_sub_folder-articles-all-id-9977.txt
...
siang@kaid:~/dataset/source-txt$ ls | wc -l
634855

and then when I attempt to load them

>>> from datasets import load_dataset                                                                                                                                        
>>> ds = load_dataset('/home/siang/dataset/source-txt')
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 4453/4453 [00:00<00:00, 25370.20it/s]
Using custom data configuration source-txt-93721ea9880f0183
Downloading and preparing dataset text/source-txt to /home/siang/.cache/huggingface/datasets/text/source-txt-93721ea9880f0183/0.0.0/21a506d1b2b34316b1e82d0
bd79066905d846e5d7e619823c0dd338d6f1fa6ad...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.89it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.64it/s]
Dataset text downloaded and prepared to /home/siang/.cache/huggingface/datasets/text/source-txt-93721ea9880f0183/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619
823c0dd338d6f1fa6ad. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 65.61it/s]
>>> ds                 
DatasetDict({       
    train: Dataset({
        features: ['text'],
        num_rows: 126843
    })                                                                                                                                                                       
})                                
>>> ds                               
DatasetDict({                                                                                                                                                                
    train: Dataset({                                                                  
        features: ['text'],                                                                                                                                                  
        num_rows: 126843                                                              
    })                                                                                                                                                                       
})                          

So I am not getting all 600k documents in the dataset. Previously I grouped all the data into csv/json according to the folder structure, so it would be

siang@kaid:~/dataset/source-json$ ls
source-folder-sub_folder-sub_sub_folder1.json
source-folder-sub_folder-sub_sub_folder2.json
...
siang@kaid:~/dataset/source-json$ ls | wc -l
93

Much lower num_rows count (ignore the split to train and validation).

DatasetDict({                                                                                                                                                                
    train: Dataset({                                                                                                                                                         
        features: ['text'],                                                                                                                                                  
        num_rows: 4230                                                                                                                                                           })                                                                                                                                                                       
    validation: Dataset({                                                                                                                                                    
        features: ['text'],                                                                                                                                                  
        num_rows: 223                                                                                                                                                        
    })                                                                                                                                                                       
})                                                                                                                                                                           

load_dataset would recognize much lower num_rows. Is there anything I can do to ensure all documents are being loaded properly? Previously we tried lumping everything in a 2GB text file, it would work but managing it is hard, hence we break the only big file into smaller ones.

anyway, after simplifying the naming of data files, i get it to work