Clarification regarding the Common Pile v0.1 dataset, specifically the discrepancy between the size mentioned in the official documentation and the total size of the data available for download via the Hugging Face Hub. According to the arXiv paper (2506.05209v1) and the website, the full Common Pile v0.1 dataset is approximately 8 TB in size. However, after downloading all available .json.gz and .jsonl.gz files from the common-pile/* repositories, the total size amounts to approximately 3.6 TB.
Could you kindly clarify the following:
- Does the 8 TB include those not exposed via the UI or HfFileSystem?
- Are there any dataset components stored in separate repositories or requiring additional authorization or private access?
- Is the discrepancy due to compression?