[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

10-zin · December 3, 2020, 4:41pm

Amazing!! I would love to contribute too.

bduvenhage · December 3, 2020, 4:41pm

Count me in @thomwolf I work with South African languages, but let me know how I can help.

chaitanyabasava · December 3, 2020, 4:53pm

This is Amazing! I also want to contribute.

edugp · December 3, 2020, 4:57pm

I would like to contribute!
I am interested in medical and Spanish datasets

skylord · December 3, 2020, 5:33pm

I would like to add to the dataset list.
This is a set of 271k Dutch tweets related to Covid19. Will be adding a sentiment score & subjectivity using the pattern library. It can then be used as a benchmark for better transformer algos

kleptsov · December 3, 2020, 5:48pm

Please count me! I’m in!

biswashbhusal2046 · December 3, 2020, 5:49pm

Hello Thomas, I would like to contribute for low resource language (Nepali )

ccraine · December 3, 2020, 6:48pm

Count me in!
Best,
Charlie

Nilansh · December 3, 2020, 7:27pm

I want to join in, little late to the party but I got a bit of different idea right now to add an updated version of a few known datasets.

Rishis · December 3, 2020, 8:03pm

This is great! I would love to contribute as well.

imrrahul · December 3, 2020, 8:21pm

@thomwolf great program I would like to participate and definitely try to add dataset related to regional languages…please add me in slack

thomwolf · December 3, 2020, 8:34pm

Ok, added all the new participants, tell me if you didn’t received the invitation to slack!

Open-sourcely yours

sadia-afrin-purba · December 3, 2020, 9:43pm

I would love to contribute in this project.

pschwllr · December 3, 2020, 10:15pm

Hey @thomwolf, I would like to join.

It could be interesting to include chemistry-related data sets because molecules and chemical reactions can be represented as text using SMILES or SELFIES.

The tricky part with molecule data sets is that SMILES/SELFIES are not unique (multiple SMILES exist, representing the same molecule). Canonical representations exist, but every cheminformatics toolkit outputs a different canonical version.

There multiple collections of data sets and tasks like Moleculenet.ai/DeepChem or TDC,

Maybe there is a good way to interface HF data sets and, for example, MoleculeNet or TDC. It would be great to also include @seyonec in this conversation if he is interested. He probably has some good inputs.

There are also numerous chemical reaction data sets (reactants>reagents>products or precursors>>products, with “.” as a separation between molecules). The challenge here is that most of them were derived from the patent text mining work of Daniel Lowe with a slightly different preprocessing: Chemical reactions from US patents (1976-Sep2016).

mariagrandury · December 3, 2020, 11:02pm

Hi @thomwolf, I think this project is amazing! I would love to join and work on Spanish datasets.

Normal-Thomas · December 4, 2020, 12:01am

@thomwolf Me too, please.

pranavnt · December 4, 2020, 12:51am

I’m really interested in contributing! Email - pranavteegavarapu5@gmail.com

bharati · December 4, 2020, 4:28am

Late to the party. Would love to contribute. Please let me know which datasets could I help with? I mainly work on NLP, but open to pick anything the team would like. @thomwolf please let me know… thanks.

prvnkmr · December 4, 2020, 5:52am

I really want to participate!!

somnath · December 4, 2020, 6:16am

I would love to contribute!

Topic		Replies	Views
Korean NLP - Introductions Languages at Hugging Face	2	1241	June 27, 2023
HuggingFace 🤗 is all you need for NLP and beyond [BLOG] 🤗Transformers	1	852	May 28, 2022
Collaborating with HuggingFace on Python Integration? Site Feedback	1	20	February 3, 2025
EMNLP Picks from the Hugging Face Science Team Research	1	4063	December 2, 2020
New disk usage quota for Hugging Face users, from December 2024 Beginners	3	177	December 11, 2024

[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

Related topics