Hi! I’m new here, please let me know if this is the wrong place to ask.
My actual question is at the very bottom, but the TL;DR is that I seek for the simplest way to, in effect, release an altered subset of a pre-existing dataset that unfortunately has a “restrictive” license. I understand that I should not expect a legal advice here: I simply want to better understand what my options are, and what has been done before.
Suppose that the Huggingface Hub already has a dataset
original = datasets.load_dataset(original_dataset_name). We wish to distribute on Huggingface a
new_benchmark dataset whose generation process somehow involved
original. However, there are legal/licensing restrictions on the distribution of
original, and it is not clear how these restrictions would translate to
new_benchmark. My understanding is that resolving all those details would be prohibitively complicated (please let me know if I’m wrong on this).
As a concrete example, suppose that
original_dataset_name = "imagenet-1k" and that
new_benchmark contains a few thousands of images that are basically undistinguishable from imagenet samples to a human’s eye. What I’m trying to avoid is complications (e.g., duplicating the EULA found at imagenet-1k · Datasets at Hugging Face ).
The “obvious” solution is to release our code for a function
generate_benchmark such that
new_benchmark = generate_benchmark(original). The users may thus obtain
original themselves (provided that they agree with the terms under which it is distributed) and run our code on it. The
new_benchmark itself was never “distributed”, so we avoid the licensing complications.
But this candidate solution brings its own share of issues, including:
Additional burden on end-users, deterring them from using our benchmark. In particular, processing
generate_benchmark(original)may be expensive in terms of time/computing resources.
Nondeterministic behavior. Since the dataset we’re publishing is a benchmark, we wish for it to be identical for all users. This would require extra care in the handling of pseudo-random number generation and race conditions.
Maintenance/compatibility nightmare. The previous point is partially addressed by checking hashes (e.g., SHA-256) of
new_benchmark’s samples, but what if some library we depend on makes a change that causes those hash checking to fail? Sure, we could host
generate_benchmarkin some Docker container, but this isn’t fully future-proof either, while adding extra burden on the end user (who only wishes to type
new_benchmark = datasets.load_dataset(new_benchmark_name)).
The other alternative I’m currently considering is to publish two artifacts: a
get_benchmark function and a
diff_like dataset. Neither of these artifacts (nor both together) convey “significant information” about
original, but a user with access to
original can obtain
new_benchmark = get_benchmark(original, diff_like). How valid this approach is legally could depend on what we mean by “significant information”.
To make this clearer, let me start with what I believe to be a clear-cut “legally fine” scenario. In this scenario,
new_benchmark is just a subset of
original with independent Gaussian noise added. Here each entry of
diff_like associates an index from a split of
original to a specific realization of Gaussian noise, and
get_benchmark simply builds
new_benchmark by adding the right noise to the right sample of
original. I believe that there are no legal issues here because
original is never involved in the creation of
Now let’s consider a slightly different scenario, one for which I’m not personally clear on the legal implications. Instead of adding independent Gaussian noise, we generate adversarial examples by adding a tiny perturbation to samples from
original (see, e.g., Attacking machine learning with adversarial examples ). Using the same strategy as before, we could store these perturbations in
diff_like where we previously stored Gaussian noise. However, unlike before, these perturbations are not independent from
original: close inspection could reveal some (glitchy/psychedelic) information about
But we could do better: whatever we wish to put in
diff_like, we could encrypt it using the corresponding
original samples as the encryption key. In that last scenario, I personally believe that this ought to be legally fine. But when the law is involved what ought to be may differ from what is…
Are there known precedents and/or established ways to achieve what I describe above? Again, I’m not expecting any legal advice, but I welcome any kind of input.