[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

Amazing!! I would love to contribute too.

1 Like

Count me in @thomwolf :slight_smile: I work with South African languages, but let me know how I can help.

1 Like

This is Amazing! I also want to contribute.

1 Like

I would like to contribute! :hugs:
I am interested in medical and Spanish datasets

1 Like

I would like to add to the dataset list.
This is a set of 271k Dutch tweets related to Covid19. Will be adding a sentiment score & subjectivity using the pattern library. It can then be used as a benchmark for better transformer algos

1 Like

Please count me! I’m in!

1 Like

Hello Thomas, I would like to contribute for low resource language (Nepali )

1 Like

Count me in!
Best,
Charlie

1 Like

I want to join in, little late to the party but I got a bit of different idea right now to add an updated version of a few known datasets.

1 Like

This is great! I would love to contribute as well.

1 Like

@thomwolf great program I would like to participate and definitely try to add dataset related to regional languages…please add me in slack

1 Like

Ok, added all the new participants, tell me if you didn’t received the invitation to slack!

Open-sourcely yours :slight_smile:

2 Likes

I would love to contribute in this project.

1 Like

Hey @thomwolf, I would like to join.

It could be interesting to include chemistry-related data sets because molecules and chemical reactions can be represented as text using SMILES or SELFIES.

The tricky part with molecule data sets is that SMILES/SELFIES are not unique (multiple SMILES exist, representing the same molecule). Canonical representations exist, but every cheminformatics toolkit outputs a different canonical version.

There multiple collections of data sets and tasks like Moleculenet.ai/DeepChem or TDC,

Maybe there is a good way to interface HF data sets and, for example, MoleculeNet or TDC. It would be great to also include @seyonec in this conversation if he is interested. He probably has some good inputs.

There are also numerous chemical reaction data sets (reactants>reagents>products or precursors>>products, with ā€œ.ā€ as a separation between molecules). The challenge here is that most of them were derived from the patent text mining work of Daniel Lowe with a slightly different preprocessing: Chemical reactions from US patents (1976-Sep2016).

1 Like

Hi @thomwolf, I think this project is amazing! I would love to join and work on Spanish datasets.

1 Like

@thomwolf Me too, please.

1 Like

I’m really interested in contributing! Email - pranavteegavarapu5@gmail.com

1 Like

Late to the party. Would love to contribute. Please let me know which datasets could I help with? I mainly work on NLP, but open to pick anything the team would like. @thomwolf please let me know… thanks.

1 Like

I really want to participate!!

1 Like

I would love to contribute!

1 Like