Amazing!! I would love to contribute too.
Count me in @thomwolf I work with South African languages, but let me know how I can help.
This is Amazing! I also want to contribute.
I would like to contribute!
I am interested in medical and Spanish datasets
I would like to add to the dataset list.
This is a set of 271k Dutch tweets related to Covid19. Will be adding a sentiment score & subjectivity using the pattern library. It can then be used as a benchmark for better transformer algos
Please count me! Iām in!
Hello Thomas, I would like to contribute for low resource language (Nepali )
Count me in!
Best,
Charlie
I want to join in, little late to the party but I got a bit of different idea right now to add an updated version of a few known datasets.
This is great! I would love to contribute as well.
@thomwolf great program I would like to participate and definitely try to add dataset related to regional languagesā¦please add me in slack
Ok, added all the new participants, tell me if you didnāt received the invitation to slack!
Open-sourcely yours
I would love to contribute in this project.
Hey @thomwolf, I would like to join.
It could be interesting to include chemistry-related data sets because molecules and chemical reactions can be represented as text using SMILES or SELFIES.
The tricky part with molecule data sets is that SMILES/SELFIES are not unique (multiple SMILES exist, representing the same molecule). Canonical representations exist, but every cheminformatics toolkit outputs a different canonical version.
There multiple collections of data sets and tasks like Moleculenet.ai/DeepChem or TDC,
Maybe there is a good way to interface HF data sets and, for example, MoleculeNet or TDC. It would be great to also include @seyonec in this conversation if he is interested. He probably has some good inputs.
There are also numerous chemical reaction data sets (reactants>reagents>products or precursors>>products, with ā.ā as a separation between molecules). The challenge here is that most of them were derived from the patent text mining work of Daniel Lowe with a slightly different preprocessing: Chemical reactions from US patents (1976-Sep2016).
Hi @thomwolf, I think this project is amazing! I would love to join and work on Spanish datasets.
@thomwolf Me too, please.
Iām really interested in contributing! Email - pranavteegavarapu5@gmail.com
Late to the party. Would love to contribute. Please let me know which datasets could I help with? I mainly work on NLP, but open to pick anything the team would like. @thomwolf please let me know⦠thanks.
I really want to participate!!
I would love to contribute!