Hi @lhoestq,
I know you are very busy but I was wondering if you are able to help me out with some questions regarding the preprocessing scripts for the dataset used in Code Parrot.
In the Code Parrot research repository, there is an implementation of Minhash LSH for deduplicating datasets. The implementation uses a tuple, code_key, consisting of base_index, repo_name, and path as a reference to get information for the duplicated clusters. The clusters are formatted in a list of dict:
cluster = [{"base_index": el[0], "repo_name": el[1], "path": el[2]} for el in cluster]
In this example, thetransformersbook/codeparrot dataset already contains columns repo_name and path. How would you recommend handling this in the new case where the only column in the dataset is text (or content in the case of Code Parrot)?
In this case with only a text column, is it suitable to define the list of dict below as:
cluster = [{"base_index": el} for el in cluster]
Or do we need to add an additional column to be used as a reference? For example if we were using a dataset such as enron emails, which contains columns for text (string) and meta, would we define the list of dict as:
cluster = [{"base_index": el, "meta": el} for el in cluster]
In this new case with the enron emails dataset, the value, el is an int and not subscriptable, throwing an error if accessed as el[0]. So the full get_duplicate_clusters function would be:
def get_duplicate_clusters(self) -> List[List[Dict]]:
"""Export the duplicate clusters.
For each cluster, the first element is the base element of the cluster.
The base element has an estimation jaccard similarity higher than the threshold with all the other elements.
Returns:
duplicate_clusters (List[List[Dict]]):
List of duplicate clusters.
"""
duplicate_clusters = []
for base, duplicates in self._duplicate_clusters.items():
cluster = [base] + list(duplicates)
# reformat the cluster to be a list of dict
cluster = [{"base_index": el, "meta": el} for el in cluster]
duplicate_clusters.append(cluster)
return duplicate_clusters
And the compute_min_hash function as:
def _compute_min_hash(element):
index, data = element
min_hash = get_min_hash([t for t in NON_ALPHA.split(data["text"]) if len(t.strip()) > 0])
if min_hash is not None:
return (index), min_hash
When I run the deduplicate_dataset function it seems to correctly remove the duplicates:
def deduplicate_dataset(
dataset: Type[Dataset], jaccard_threshold: float = 0.85
) -> Tuple[Type[Dataset], List[List[Dict]]]:
"""
Example:
>>> from datasets import load_dataset
>>> from minhash_deduplication import deduplicate_dataset
>>> ds = load_dataset("conceptofmind/pile_enron_emails", split="train")
>>> ds_dedup, duplicate_clusters = deduplicate_dataset(ds, jaccard_threshold=0.95)
"""
duplicate_clusters = make_duplicate_clusters(dataset, jaccard_threshold)
duplicate_indices = set(x["base_index"] for cluster in duplicate_clusters for x in cluster)
extreme_dict = {}
extremes_clusters = find_extremes(duplicate_clusters, dataset, jaccard_threshold)
for extremes in extremes_clusters:
for element in extremes:
extreme_dict[element["base_index"]] = element
remove_indices = duplicate_indices - set(extreme_dict.keys())
ds_filter = dataset.filter(lambda x, idx: idx not in remove_indices, with_indices=True)
# update duplicate_clusters
for cluster in duplicate_clusters:
for element in cluster:
element["is_extreme"] = element["base_index"] in extreme_dict
if element["is_extreme"]:
element["copies"] = extreme_dict[element["base_index"]]["copies"]
return ds_filter, duplicate_clusters
I get the results:
Original dataset size: 237585
Number of duplicate clusters: 12566
Files in duplicate cluster: 29247
Unique files in duplicate cluster: 14791
Filtered dataset size: 223129
Time to deduplicate dataset: 104.63
Size of deduplicate dataset: 223129
Is there anything I could possibly be missing? Do I need to refactor any parts of the script differently to meet this general use case with enron emails, other than what I have listed above?
I think it would be very beneficial to the community to have an accessible deduplication script with MinHash LSH, or a reference/blog post to how to implement it with different datasets.
I greatly appreciate your time and consideration.
Thank you,
Enrico