Problem "Bad request" when using datasets.Dataset.push_to_hub()

First of all, thank you for your reply. I’m really appreciate.

So, I want to say that during the time I’m searching for the solution, I have re-runned my code without making any changes in the code, I just re-run and hope that this time it will work. And well, it’s work :slight_smile: I don’t know, I feel like it’s kind of randomness ? Anyway, it worked, I pushed my dataset to huggingface sucessfully.
About your recommedation of checking the size of chunk data/train-00253-of-00432.parquet, I don’t know how to check it because at the origin, my dataset is saved in .arrow files, and when I excute the function Dataset.push_to_hub(), the system automatically creates .parquet files from the .arrow files and pushes the created .parquet files to huggingface. So, I do not know how to check the size (and information) of the .parquet files. I guess maybe there is a function that creates .parquet files from the .arrow ? Then if it does, I can use that function first, save the created .parquet files locally and then, check the size of the .parquet files ?

About the versions, now I’m using huggingface_hub of version 0.24.7 and ‘datasets’ of version 3.0.0. I think these are in the latest versions because I just updated those yesterday.

And about the error ModuleNotFoundError: No module named 'apt_pkg', it is in my python3 directory, I trying to figure out how to fix this. I’m working on my company’s server and I haven’t get used to it yet, so it takes me a bit time to figure out the system.

One final thing I want to say is that I’m thinking of an assumption. When the function Dataset.push_to_hub() is executing, I have observed something like one shard is one .parquet file, I guess ? And if that’s true, then there could be one shard that have the size smaller than the minimum allowed size (like in the message of Bad request: says) and that causes the error. So, the solution can be that we need to define either the size of a shard or the number of shards. And well, in the documents of the function Dataset.push_to_hub, there are two parameters to define these values, they are max_shard_size and num_shards. In conclusion, I think the error can be solved by defining those parameters properly (in my code, I do not define them, so their values are default). Anyway that is just my assumption, I haven’t tried it yet. In the future, if I encounter the same error, I will try this solution.
If anyone know the true solution for this error, please comment. And for someone who are also looking for the solution, perhaps you can try to re-run the code and pray :grin: or you can try my assumption by defining the two parameters mentioned above.

@not-lain thanks again for your help. Wish you will read my reply !

1 Like