Custom Siamese dataset

Hi,

I am trying to implement a custom Siamese dataset using Hugging Face Datasets to eventually publish on the hub.

I have a list of positive pairs and I generate negative pairs on the fly during training (the number of possible negative pairs is huge and it would be inefficient to store them all). I have not seen how to do that in the docs.

Am I missing something or should I really just use a regular torch.utils.data.Dataset subclass and give up on publishing it?

Thanks a lot for you help!

Hi ! I think you can consider the positive pairs and the negative pairs as two separate datasets, and use the sampling strategy you want in your training loop for each dataset.

Thanks for your response. Since there is way too many negative pairs, it is not really efficient (nor necessary) to store them. Because an element can be part of many pairs, I have a dataset for elements and a second one for the positive pairs.

Ok I see, that makes perfect sense :slight_smile: