Protein/molecule datasets

Hi, can we host the protein/molecule datasets in HuggingFace?
I only saw the image, text, and audio categories.
We are looking for a public host for a processed training dataset (~2TB) for AlphaFold2.

2 Likes

Hi! These categories are there only to make the search easier, so you are more than welcome to host your dataset on the Hub. Also, feel free to open an issue in the datasets repo to propose a category/categories we should add to the Hub to address datasets like yours.

2 Likes

Thank you!
BTW, the data used to train AlphaFold2 is quite large (~2TB), and contains millions of files. Besides, the pipeline of data loader is also quite complex.
Are there any bottlenecks to hosting it?
Thank you in advance!

Hi again! I think there shouldn’t be any bottlenecks (we host some even larger datasets). Still, you can try to reduce the size by storing the dataset in Parquet, which can be done by generating the processed dataset using a dataset script (load_dataset("./path/to/script")) and then calling push_to_hub on it.

2 Likes