Parquet file download error: AccessDenied

Hi,

I cannot download some parquet files of my dataset, receiving the following error:

<Error>

<Code>AccessDenied</Code>

<Message>Access Denied</Message>

<RequestId>HDCM44CNRZ6JTXR8</RequestId>

<HostId>hhwLQ0ETCYBSHFG4QvuVKCpmpDXzGWGAnevEXv+vmG5d6ArJr26TOnxiT8rw6iELoCI/1WOcK/o=</HostId>

</Error>

I also tried to delete and reupload with push_to_hub one of the broken parquet files, but the error persists.

Dataset Link:

Broken parquet samples:
‘data/train-00027*’
‘data/train-00470*’
‘data/train-00479*’

How could I solve this problem?
@mariosasko

Thanks,
Rosario

I can also reproduce the error using the Hub UI.

cc @pierric @Wauplin (this op was executed on the repo for the context)

Thanks for the ping @mariosasko.

@elsaEU We’ve investigated the issues on your repo and have an idea of what happened. Is it possible that you’ve concurrently tried to run push_to_hub, leading to concurrent uploads of some same files? (only a supposition at the stage).
In any case, I’ve listed the failing files and there were 103 of them. They have being correctly removed from our database meaning that if you tried to reupload them now, you should be able to correctly re-upload them. Could you give it a try and report back please? Thanks in advance :slight_smile:

Here is the list of failing files if that can help you reupload them:

data/train-00027-of-05674-b76e1f9d01449dac.parquet
data/train-00470-of-05674-1c19447990ee9d0d.parquet
data/train-00482-of-05674-8de43527f6d0946b.parquet
data/train-00479-of-05674-b305a087f021a1b6.parquet
data/train-00581-of-05674-056a23b1e63613e5.parquet
data/train-00584-of-05674-3a80756271c09512.parquet
data/train-00614-of-05674-fb18e9d24025dff0.parquet
data/train-00652-of-05674-4a599908b19b68fd.parquet
data/train-00701-of-05674-70330cc2e4d94c15.parquet
data/train-00713-of-05674-4e280e17d993cd6f.parquet
data/train-00721-of-05674-352c6d0a23ccfb1b.parquet
data/train-00724-of-05674-ebcde608da6e60e5.parquet
data/train-00783-of-05674-6550a39b6a58c34a.parquet
data/train-00799-of-05674-efcff521ea517cd2.parquet
data/train-00817-of-05674-a663d7e24e5c4d36.parquet
data/train-00822-of-05674-c8ecb05d922718c2.parquet
data/train-00829-of-05674-d66f929e0c7c83bf.parquet
data/train-00835-of-05674-e94730a0c9773b5f.parquet
data/train-00826-of-05674-a26ab63f20e1fbc1.parquet
data/train-00873-of-05674-938e6d186e18ee53.parquet
data/train-00879-of-05674-2866a4f6ecf7eca3.parquet
data/train-00877-of-05674-833121b4743f99b0.parquet
data/train-00899-of-05674-64b9c798a42778f4.parquet
data/train-00935-of-05674-4a97f56caad5063b.parquet
data/train-00936-of-05674-5e8f7dd44adf7ac5.parquet
data/train-00937-of-05674-e55c79713875eab3.parquet
data/train-01000-of-05674-cb60bc6b2717fc07.parquet
data/train-01002-of-05674-835e751bb8c531ee.parquet
data/train-01015-of-05674-4f80d03ea7162107.parquet
data/train-01117-of-05674-7c2f6a551ae8d642.parquet
data/train-01288-of-05674-e2799d0ba71251ad.parquet
data/train-01391-of-05674-a7e49d5d8ffaefdc.parquet
data/train-01399-of-05674-d7196b9b293675b3.parquet
data/train-01423-of-05674-c9e5669b50aec517.parquet
data/train-01417-of-05674-b462fe996aad0ff7.parquet
data/train-01506-of-05674-eaad06aa376aec5e.parquet
data/train-01513-of-05674-f1bb10f7a0039393.parquet
data/train-01633-of-05674-4eab7efae9d45c7d.parquet
data/train-01764-of-05674-29a62903f4e2984d.parquet
data/train-01771-of-05674-0efc849c9c3f909f.parquet
data/train-01786-of-05674-a750df15ffd6f278.parquet
data/train-01799-of-05674-0b0c876392c50881.parquet
data/train-01883-of-05674-9014910f17a146eb.parquet
data/train-01942-of-05674-0e45f0340c5ee53f.parquet
data/train-01953-of-05674-3ae0c3061d500eaa.parquet
data/train-02114-of-05674-0342d0c974d28e89.parquet
data/train-02702-of-05674-afc05304e442c869.parquet
data/train-02703-of-05674-d9c9a0c855888af1.parquet
data/train-02714-of-05674-fdbece27820935f6.parquet
data/train-02754-of-05674-d182d7b7187d461e.parquet
data/train-02877-of-05674-28ede664099d6a51.parquet
data/train-02880-of-05674-ca20b70117e1c030.parquet
data/train-02868-of-05674-a17c4bed39dedde5.parquet
data/train-02931-of-05674-b450b1406f93347c.parquet
data/train-03073-of-05674-78f787cbd4a7e05d.parquet
data/train-03145-of-05674-3372eb71c93a9ed5.parquet
data/train-03197-of-05674-356009d0e5bec49f.parquet
data/train-03276-of-05674-11911a55e79d8eba.parquet
data/train-03301-of-05674-aa1067a1655c1393.parquet
data/train-03353-of-05674-a0ddaa03c6a57300.parquet
data/train-03365-of-05674-d2cdcfda9bb421ee.parquet
data/train-03526-of-05674-81511bf610ae761b.parquet
data/train-03529-of-05674-17b7c78a853fbfb1.parquet
data/train-03535-of-05674-48ef83fb32266bb7.parquet
data/train-03547-of-05674-5d98086469f9a1cf.parquet
data/train-03559-of-05674-ae5272a759251c98.parquet
data/train-03567-of-05674-46dee4cc919ea8e9.parquet
data/train-03586-of-05674-0f369c3b17a15692.parquet
data/train-03598-of-05674-61841498e0e0c26c.parquet
data/train-03610-of-05674-1f9fed21c940dffd.parquet
data/train-03724-of-05674-2955170586ca6761.parquet
data/train-03729-of-05674-1cbf18ca4a25eb77.parquet
data/train-03730-of-05674-f15ab059d5dee4b7.parquet
data/train-03743-of-05674-b6b4379b60b575e4.parquet
data/train-03763-of-05674-2f7c181ef7419496.parquet
data/train-03766-of-05674-94ffc536591cc1bf.parquet
data/train-03799-of-05674-bf5e9c4352d71f29.parquet
data/train-03807-of-05674-09ff78d85feaf199.parquet
data/train-03826-of-05674-3d6f05733c613a8a.parquet
data/train-03838-of-05674-12e5c4b61fc83d6e.parquet
data/train-03854-of-05674-6643cb809183375f.parquet
data/train-03888-of-05674-1c02c84c472b4709.parquet
data/train-04201-of-05674-4e7003cb2ac9e5b3.parquet
data/train-04223-of-05674-0ad7c879a059cc34.parquet
data/train-04227-of-05674-b7d52aa2bd408c89.parquet
data/train-04248-of-05674-c0f94af70f31de40.parquet
data/train-04305-of-05674-7197085fd565249c.parquet
data/train-04322-of-05674-cc69deaa7b20283d.parquet
data/train-04583-of-05674-8f0491d4600c5cba.parquet
data/train-04659-of-05674-2a709b728c06c820.parquet
data/train-04662-of-05674-eecd2ef84a6cb8d5.parquet
data/train-04735-of-05674-5137caad5aab7546.parquet
data/train-04842-of-05674-fa092087792e291e.parquet
data/train-05240-of-05674-b72e31eb44212b0c.parquet
data/train-05377-of-05674-ba9a9e32a88830c7.parquet
data/train-05374-of-05674-ef32b5f976b8f0cc.parquet
data/train-05372-of-05674-4aa68ee310c46eff.parquet
data/train-05431-of-05674-23c296489d85e2d1.parquet
data/train-05471-of-05674-8c4f7f35e25f9660.parquet
data/train-05477-of-05674-cf2aa56c6eed2d0a.parquet
data/train-05498-of-05674-529501c370887288.parquet
data/train-05562-of-05674-8fe378ec7679297f.parquet
data/train-05673-of-05674-d8d2037e9e86fbb0.parquet

Thank you @Wauplin I was able to manually delete the

train-00027
train-00470
train-00482

Parquet files using the Hub UI and reupload them correctly.

Is there any way to do a commit for removing all broken parquets of the list using the HF api without the need to download the whole repo or using the UI?

Is it possible that you’ve concurrently tried to run push_to_hub , leading to concurrent uploads of some same files? (only a supposition at the stage).

During the first upload, the process was interrupting very frequently maybe for a connection error due to using a proxy. In that case the push_to_hub process would restart checking all the parquet already uploaded. To avoid waiting at each disconnection I put a retry on the create_commit method of hf_api.py. I think this may have created the concurrency problem (or any other troubles, sorry for this!).

You can delete the broken files programmatically using hugginface_hub:

from huggingface_hub import create_commit, CommitOperationDelete

files_to_delete = [...] # list of files to delete
operations = [CommitOperationDelete(path_in_repo=file) for file in files_to_delete]
create_commit(repo_id=<repo_id>, operations=operations, commit_message=<commit_message>)
1 Like

Thank you guys, the dataset has been fixed.