Joining datasets by column & best practices for multi-view datasets

Hey :wave:

I would like to contribute a dataset to HF Datasets but am unsure about the best practices on how to handle datasets that have more than one view for the data.

The data: My dataset consists of two tables. The first table contains the Items, each with an ID and text. The second table contains relations between the items. Each row of the second dataset looks like (id_a, id_b, type), where id_a and id_b are IDs from the first table and type further specifies the relation.

Current processing: For my local experiments, I load both tables into memory with pandas, join the two tables together such that I get a larger table with columns (id_a, text_a, id_b, text_b, type), save that as a third CSV file. I can now read either the table with the single items or the merged table. I do not use the relations table directly beyond that.

I have two questions here:

  1. I read in some other topic that it is recommended to first join the data and then read the joined data with the datasets loaders. Is there any way to implement the Pandas join with Huggingface Datasets?
  2. How would I publish this as a hf dataset such that potential users can use either the items from the first table or the paired items from the second table?

Not all items from the first table have a relation. If I only publish the merged table, I would thus lose a large portion of the raw dataset. Additionally, many items also have more than one relation. If I upload the pre-merged table, it would contain a lot of redundant texts.

Thank you very much in advance!

Hi! datasets currently doesn’t support SQL queries, but we plan to add support for that soon. Before that happens, it’s probably best to either upload the merged table alongside the original tables or the original tables only with the code in README that can be used to merge them.

1 Like

How do I merge(join) two HF datasets based on a common column using map function

currently, I am doing this
def add_columns_train(example, index):
example.update(train_dataset[index])
example.update(labeled_train[index])
return example

But it does based on the same index, I want to merge based on a common column say ‘UID’

You need to have a dict {column_value: index} or do the merge by converting the dataset to Pandas or Polars that support joins for example (using to_pandas() or to_polars())