So I have a local dataset which contains pure text documents, one article per file
siang@kaid:~/dataset/source-txt$ ls
source-folder-sub_folder-sub_sub_folder-articles-all-id-99643.txt
source-folder-sub_folder-sub_sub_folder-articles-all-id-9977.txt
...
siang@kaid:~/dataset/source-txt$ ls | wc -l
634855
and then when I attempt to load them
>>> from datasets import load_dataset
>>> ds = load_dataset('/home/siang/dataset/source-txt')
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 4453/4453 [00:00<00:00, 25370.20it/s]
Using custom data configuration source-txt-93721ea9880f0183
Downloading and preparing dataset text/source-txt to /home/siang/.cache/huggingface/datasets/text/source-txt-93721ea9880f0183/0.0.0/21a506d1b2b34316b1e82d0
bd79066905d846e5d7e619823c0dd338d6f1fa6ad...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.89it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.64it/s]
Dataset text downloaded and prepared to /home/siang/.cache/huggingface/datasets/text/source-txt-93721ea9880f0183/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619
823c0dd338d6f1fa6ad. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 65.61it/s]
>>> ds
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 126843
})
})
>>> ds
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 126843
})
})
So I am not getting all 600k documents in the dataset. Previously I grouped all the data into csv/json according to the folder structure, so it would be
siang@kaid:~/dataset/source-json$ ls
source-folder-sub_folder-sub_sub_folder1.json
source-folder-sub_folder-sub_sub_folder2.json
...
siang@kaid:~/dataset/source-json$ ls | wc -l
93
Much lower num_rows
count (ignore the split to train and validation).
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 4230 })
validation: Dataset({
features: ['text'],
num_rows: 223
})
})
load_dataset
would recognize much lower num_rows
. Is there anything I can do to ensure all documents are being loaded properly? Previously we tried lumping everything in a 2GB text file, it would work but managing it is hard, hence we break the only big file into smaller ones.