Convert_to_parquet fails for datasets with multiple configs

Hi, we are in the process of converting our datasets with data loading scripts to data only using the convert_to_parquet command via the datasets-cli.
We noticed that for datasets with multiple configs the script throws a BadRequestError and displays the following error message at the end:
“Bad request:
Invalid reference for a branch: refs/pr/1”

The script loads the default configuration and creates a PR. That works fine.
The .push_to_hub method returns a commit info with pr_revision: ‘refs/pr/1’ and pr_url: ‘DFKI-SLT/science_ie · Convert dataset to Parquet
But when the script iterates through the configs it tries to create a branch using ‘refs/pr/1’ , which results in the previously mentioned BadRequestError.

Is there a way to fix this? I guess it would also be possible to manually load each config and push them directly onto the main branch. (That has been my workaround for now)

1 Like

cc @albertvillanova

Normally, the reuse of the same “refs/pr/1” revision was intended to make the conversion of all configs in a single pull request.

Maybe you merged the PR while it was still converting the rest of the configs?

Thanks for the reply.

No, I did not merge the PR while it was still converting.
I was able to reproduce the issue on a dataset, where I ran the conversion for the first time.

That issue also persists when I rerun the conversion script on the same dataset after the initial fail.
The script creates a new PR each time after loading and pushing the default configuration.
As soon as it processes the second config and tries to push it to the hub, I’m met with the BadRequestError. I guess it can’t find the revision.

Thanks for reporting and giving all the details, @phucdev.

In fact, it seems a bug. I am investigating it.

Hi again, @phucdev.

Sorry, but I can’t reproduce the issue. The conversion of the multiple configs is done in a single PR as expected. See: DFKI-SLT/science_ie · Convert dataset to Parquet with multiple configs converted.

I guess you are maybe using an old version of the huggingface-hub library. Could you please check its version?

import huggingface_hub
print(huggingface_hub.__version__)

Hi @albertvillanova .
I just checked my huggingface_hub version, but it says “0.23.0”, which is the newest version AFAIK.
The datasets library version is “2.19.1”. I also definitely logged in via the huggingface-hub CLI.

Maybe I don’t have permission to access the “refs/pr/1” revision while you do? After the conversion script loads and pushes the default config, I can see the PR, but there is no “refs/pr/1” branch visible. The url pointing to that branch results in the HTTPError 400/ BadRequestError that I recorded in the screenshot in my previous comment.

I just created a new dataset based on the DFKI-SLT/fabner dataset (the version on the script branch) and tested the conversion command, but I still got the BadRequestError.

Perhaps you could try to reproduce it on this clean new dataset: phucdev/fabner · Datasets at Hugging Face

Thanks again for your valuable feedback. Then, we can exclude the issue coming from an old huggingface-hub.

The “refs/pr/1” is nos strictly speaking a “branch” but a Git reference. And everybody has access to it: DFKI-SLT/science_ie at refs/pr/2
Anyway, I do not think this is the issue.

I am guessing that it could be caused because of having logged in via the huggingface-hub CLI…

Could you please try instead to pass a user access token (with write rights) to the convert_to_parquet command: --token YOUR-TOKEN

Note that again I could use convert_to_parquet on your new dataset, and all the configs were converted in the same PR: phucdev/fabner · Convert dataset to Parquet

I just passed my user access token (with write rights) to the convert_to_parquet command, but the issue persists.
I also tried to logout via huggingface-cli logout and then use the convert_to_parquet command with my access token, but with no success.

Regarding the url pointing to “refs/pr/1”:
I see. The way I understood the error traceback, the script tries to create a branch for “refs/pr/1” at “https://huggingface.co/api/datasets/phucdev/fabner/branch/refs%2Fpr%2F1”, which fails with code 400.
That is why I run into:

requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/datasets/phucdev/fabner/branch/refs%2Fpr%2F1

+1 (Exact same issue here)

+1 (Exact same issue here)


+1 - same exact issue.

I have opened an issue in the huggingface-hub repo:

And I have opened a PR in the datasets repo:

I hope this fixes the bug. Please, feel free to use the source code with the fix and see if it works now.