I have created a dataset dict with 10 splits and am attempting to push it to the dataset hub but keep getting a HFHubHTTPError: 413 Client Error: Payload Too large for url: error.
In total, the dataset is ~50GB. For my initial attempt it made commits 0 and 1 of 3 (Upload dataset (part 00000-of00003)) before failing. On my second attempt, I reduced the max shard size to 250MB (which caused uploaded files to be ~150MB) and it got to commit 3 of 5 before failing.
To save progress, I tried using a for loop to upload splits one at a time:
dataset = datasets.load_from_disk(save_path)
for split, ds in dataset.items():
print(split)
ds.push_to_hub("dconnell/pubtator3_abstracts", split=split)
But now it is repeatedly failing for the first split with the same too large error. If I skip the first split it fails on the second split the same way.
The shard sizes are only a few hundred MB at most and the locally saved version of the dataset doesn’t have any files that are abnormally large compared to the rest. Any idea why I am getting a too large error?
Should I upload with the huggingface_hub.upload_folder api instead? Looping over each split directory of my locally saved dataset? My main concern about that is the currently uploaded files are organized differently (I think because of the conversion to parquette) and I want to make sure the dataset is uploaded in the preferred configuration.
I did see that issue, but none of my files are anywhere near that 50GB limit. And since it splits the upload across multiple commits each commit is less than 50GB. I’m also confused how it managed to upload 25+GB before having an issue. Or why I’m getting the issue when uploading each 5GB split individually.
I can go ahead and use the huggingface_hub functions directly I just figured there’s a reason the dataset classes have their own push_to_hub methods and I should try to use that over the more generic functions. (Especially since I can see it is converting and reorganizing the dataset before pushing.)
I pushed with upload_large_folder but now I can’t get the splits recognized. In Structure your repository is says you can add the splits to the README’s yaml section. After doing that, the hub picked up the splits but load_dataset keeps telling me train is the only split.
ValueError: Unknown split "BioCXML_9". Should be one of ['train'].
This is even after clearing my local huggingface cache.
I tried reorganizing the file hierarchy as detailed in the “Automatic splits detection” section using the structure:
But neither the hub nor load_dataset is recognizing the splits in that style.
In relation to this, it seems non-intuitive that the save_to_disk method tracks splits differently than the hub does. In order to upload properly, I have to move directories into a data directory. Does this seem like something that should be look into in a github issue? save_to_disk tracks the splits in a dataset_dict.json file that upload_large_folder ignored.
I’ve narrowed it down to a large ClassLabel in the features.
If I reduce the dataset down to 100 rows, I’m still getting the payload error.
Several of my columns use Sequence of Classlabel for features, if I switch those to a Sequence of Value("int64") the push succeeds.
I can reproduce the error with the following attempt to simulate my data:
import random
random.seed(42)
def random_str(sz):
return "".join(chr(random.randint(ord("a"), ord("z"))) for _ in range(sz))
data = datasets.DatasetDict(
{
str(i): datasets.Dataset.from_dict(
{
"label": [list(range(3)) for _ in range(10)],
"abstract": [random_str(10_000) for _ in range(10)],
}
)
for i in range(3)
}
)
features = data["1"].features.copy()
features["label"] = datasets.Sequence(
datasets.ClassLabel(names=[random_str(20) for _ in range(30_000)])
)
data = data.map(lambda examples: {}, features=features)
data.push_to_hub("dconnell/pubtator3_test")
30,000 20 character strings (~600KB) doesn’t seem like it should be enough to cause this issue. Maybe it’s the way this is stored in parquette?
With my above example (I think we responded at the same time so you may of missed it), dataset size is nowhere near a GB so I don’t think that is the issue. With sys.getsizeof shows the size of the ClassLabels.names as ~250KB so that shouldn’t be too big but using a ClassLabel instead of a Value for the features causes the error.
I tried upload_large_folder and it does upload the content but it isn’t recognizing my splits even when using the file structure suggested by the docs (see above response).
Sense it is uploading the arrow tables, I’m thinking this has something to do with how parquette is storing the ClassLabels.