Using interleave_datasets with probabilities

Hi,

I’m trying to use the interleave_datasets function with probabilities. Now, my use case some probabilities might be 0. For example, probabilities=[0.5, 0.5, 0] where I would like to sample from the first and second dataset at 50% likelihood, and never from the third.

Reading the documentation I figured this would be supported. But when trying it, the code seems to loop forever. Does anyone have any experience with this, or some more knowledge of the inner workings that could clear this up?

Thanks in advance!

Ok, I think I figured out why it was infinitely looping. I’m posting this in case it helps someone in the future, but just know it’s not more than an educated guess.

I was using stopping_strategy='all_exhausted', so it would keep trying sample until it had fully sampled all datasets. This causes a problem when certain datasets had probability 0, so there was never any sampling from these. I think this caused the infinite loop.

I ended up solving it by doing a little preprocessing: delete all 0s from probabilities as well as their corresponding datasets from the datasets list.

Hope this helps someone in the future. If someone has a better understanding of the interleave_datasets function and thinks this is wrong, please let me know.

1 Like