Cannot access RedPajama-Data-1T-Sample sub-file

Used the following code to access arxiv_sample.jsonl from 1B-sized RedPajama-Data-1T-Sample but met a FileNotFound error. However, when clicking the link, I in fact can download the .jsonl file manually. Any clue why this happen? How can I enable loading in the code?

dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", data_files="arxiv_sample.jsonl")

FileNotFoundError: Unable to find 'https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample/resolve/main/arxiv_sample.jsonl'

Hi, if you only want to access the arxiv_sample you’ll need to download it manually and then you can load it like this:

from datasets import load_dataset
ds = load_dataset("json", data_files="arxiv_sample.jsonl")

The error message is incorrect indeed.

The actual reason is that the dataset has a loading script RedPajama-Data-1T-Sample.py · togethercomputer/RedPajama-Data-1T-Sample at main that doesn’t check the data_files config argument