RFC: Licensing datasets that alter existing datasets

Hi! I’m new here, please let me know if this is the wrong place to ask.

My actual question is at the very bottom, but the TL;DR is that I seek for the simplest way to, in effect, release an altered subset of a pre-existing dataset that unfortunately has a “restrictive” license. I understand that I should not expect a legal advice here: I simply want to better understand what my options are, and what has been done before.

Problem statement

Suppose that the Huggingface Hub already has a dataset original = datasets.load_dataset(original_dataset_name). We wish to distribute on Huggingface a new_benchmark dataset whose generation process somehow involved original. However, there are legal/licensing restrictions on the distribution of original, and it is not clear how these restrictions would translate to new_benchmark. My understanding is that resolving all those details would be prohibitively complicated (please let me know if I’m wrong on this).

As a concrete example, suppose that original_dataset_name = "imagenet-1k" and that new_benchmark contains a few thousands of images that are basically undistinguishable from imagenet samples to a human’s eye. What I’m trying to avoid is complications (e.g., duplicating the EULA found at imagenet-1k · Datasets at Hugging Face ).

Legally clear but otherwise inconvenient approach

The “obvious” solution is to release our code for a function generate_benchmark such that new_benchmark = generate_benchmark(original). The users may thus obtain original themselves (provided that they agree with the terms under which it is distributed) and run our code on it. The new_benchmark itself was never “distributed”, so we avoid the licensing complications.

But this candidate solution brings its own share of issues, including:

  1. Additional burden on end-users, deterring them from using our benchmark. In particular, processing generate_benchmark(original) may be expensive in terms of time/computing resources.

  2. Nondeterministic behavior. Since the dataset we’re publishing is a benchmark, we wish for it to be identical for all users. This would require extra care in the handling of pseudo-random number generation and race conditions.

  3. Maintenance/compatibility nightmare. The previous point is partially addressed by checking hashes (e.g., SHA-256) of new_benchmark’s samples, but what if some library we depend on makes a change that causes those hash checking to fail? Sure, we could host generate_benchmark in some Docker container, but this isn’t fully future-proof either, while adding extra burden on the end user (who only wishes to type new_benchmark = datasets.load_dataset(new_benchmark_name)).

Legally unclear but otherwise preferable alternative

The other alternative I’m currently considering is to publish two artifacts: a get_benchmark function and a diff_like dataset. Neither of these artifacts (nor both together) convey “significant information” about original, but a user with access to original can obtain new_benchmark = get_benchmark(original, diff_like). How valid this approach is legally could depend on what we mean by “significant information”.

To make this clearer, let me start with what I believe to be a clear-cut “legally fine” scenario. In this scenario, new_benchmark is just a subset of original with independent Gaussian noise added. Here each entry of diff_like associates an index from a split of original to a specific realization of Gaussian noise, and get_benchmark simply builds new_benchmark by adding the right noise to the right sample of original. I believe that there are no legal issues here because original is never involved in the creation of diff_like nor get_benchmark.

Now let’s consider a slightly different scenario, one for which I’m not personally clear on the legal implications. Instead of adding independent Gaussian noise, we generate adversarial examples by adding a tiny perturbation to samples from original (see, e.g., Attacking machine learning with adversarial examples ). Using the same strategy as before, we could store these perturbations in diff_like where we previously stored Gaussian noise. However, unlike before, these perturbations are not independent from original: close inspection could reveal some (glitchy/psychedelic) information about original.

But we could do better: whatever we wish to put in diff_like, we could encrypt it using the corresponding original samples as the encryption key. In that last scenario, I personally believe that this ought to be legally fine. But when the law is involved what ought to be may differ from what is…

My question

Are there known precedents and/or established ways to achieve what I describe above? Again, I’m not expecting any legal advice, but I welcome any kind of input.