Using interleave_datasets with probabilities

pvversteeg · January 26, 2024, 4:46pm

Hi,

I’m trying to use the interleave_datasets function with probabilities. Now, my use case some probabilities might be 0. For example, probabilities=[0.5, 0.5, 0] where I would like to sample from the first and second dataset at 50% likelihood, and never from the third.

Reading the documentation I figured this would be supported. But when trying it, the code seems to loop forever. Does anyone have any experience with this, or some more knowledge of the inner workings that could clear this up?

Thanks in advance!

pvversteeg · January 27, 2024, 6:13am

Ok, I think I figured out why it was infinitely looping. I’m posting this in case it helps someone in the future, but just know it’s not more than an educated guess.

I was using stopping_strategy='all_exhausted', so it would keep trying sample until it had fully sampled all datasets. This causes a problem when certain datasets had probability 0, so there was never any sampling from these. I think this caused the infinite loop.

I ended up solving it by doing a little preprocessing: delete all 0s from probabilities as well as their corresponding datasets from the datasets list.

Hope this helps someone in the future. If someone has a better understanding of the interleave_datasets function and thinks this is wrong, please let me know.

Topic		Replies	Views
A couple of questions about interleave_datasets() 🤗Datasets	7	1888	March 28, 2024
How does one fix an interleaved data set from only sampling one data set? Beginners	1	362	August 14, 2023
Yielding items from multiple datasets in parallel 🤗Datasets	4	844	February 8, 2024
How to sample batches from multiple datasets? 🤗Datasets	2	1942	January 18, 2024
Can I make the interleave dataset for the longest one 🤗Datasets	1	1358	August 12, 2022

Using interleave_datasets with probabilities

Related topics