[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

Divnoorsingh675 · November 26, 2020, 10:23pm

Yea im too interested🤩

Elron · November 26, 2020, 10:23pm

Hi, I would love to join the effort. We have few Hebrew datasets that will be very useful for Hebrew users: https://github.com/NLPH/NLPH_Resources

Luis · November 26, 2020, 11:32pm

This is great! I would love contribute in this work

stefan-it · November 27, 2020, 12:04am

I think I can integrate the (or a) Finnish NER dataset: “A Finnish News Corpus for Named Entity Recognition”.

Repository can be found under:

with all necessary data

AmitMY · November 27, 2020, 7:15am

I’m happy to see many people who would like to participate and create new datasets.

However, before that, I do not want to be a downer, but I have been adding datasets for a couple of months now, and there are some serious drawbacks that I believe need to be resolved in order for hf/datasets to lend itself into more complex, not text-only datasets, and I don’t think newcomers should have to face.

Issues I think are problematic:

issue 885 - when developing a dataset, and messing a bit with the code to see if something was done correctly, unfortunately just importing “datasets” takes so long that its discouraging to keep developing.
Supporting sequence of arrays/dynamic arrays - issue 887 - is not yet implemented. This makes it a problem to store images, videos, pose estimations, point clouds, etc, and we have to resort to ugly hacks.
No support for multiprocessing - issue 786 - I think its crucial, for large datasets, that you support multiprocessing. I’m working with very complex datasets, that processing each sample takes some time, and I can’t wait 24 hours to load the entire dataset to make sure its good, or can’t expect users to do the same.
Dataset write batch size is not currently working - issue 741 - making it impossible to save datasets that don’t completely fit into memory (even using 500GB of memory). This automatically, I think, disallows any video datasets.

I say this with all the love, I really like this repo, and I think it can do better to accommodate varied types/sizes of datasets.

vinaykudari · November 27, 2020, 8:03am

Keep to contribute, count me in

michalkichal · November 27, 2020, 10:03am

I can help out with polish datasets. Here’s a list we can use https://github.com/ksopyla/awesome-nlp-polish

SkanderHR · November 27, 2020, 10:09am

I would love to contribute to this!

danurahul · November 27, 2020, 10:09am

love to participate

dhruvrnaik · November 27, 2020, 10:47am

Would love to help, especially with Indic languages!

lhoestq · November 27, 2020, 11:22am

Hi @AmitMY, thanks for your feedback. We value community feedback a lot and we are constantly working on improving on what we already have.
Sorry that we’ve not be able to add your video datasets yet. Sign language datasets are something valuable and we’re looking forward to be able to share them with the community through datasets.
That’s why we’ll work hard on the points you mentioned : 2 and 4 for video datasets and 1 and 3 to speed up things even more !
I would also like to mention that improvement of import time is already in our pipeline!
And that this sprint is for adding text datasets only and we hope help from the community is going to be very valuable in improving the coverage provided by the datasets library when it comes to different languages and different types of text/nlp problems.
Feel free to ping me or the team on github if you find issues or if you have ideas of things to improve.

adam-desormiere · November 27, 2020, 12:06pm

Hello @thomwolf & @lhoestq ,

I’ve been scraping a large part of major French review websites over the past months. Thus, I’ve now millions of raw [ review + score ] ideally suited for Sentiment Analysis.

I took as a model your own Allocine dataset, but mine is ~30 times bigger.

If you think it might be useful, I would be glad to contribute !

Cheers,
Adam

HimanshiSingh004 · November 27, 2020, 12:35pm

Would love to contribute!

dhilgart · November 27, 2020, 2:25pm

Count me in!

imtk · November 27, 2020, 5:29pm

Thai Literature Corpora (TLC) https://attapol.github.io/tlc.html
Asian Language Treebank (ALT) Project https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/

ashmeet · November 27, 2020, 9:04pm

Hi! Would love to be a part of this and contribute.

lukawskikacper · November 27, 2020, 11:02pm

Hi! I’m also eager to help! Do you need any support with polish datasets?

arkhalid · November 28, 2020, 3:21am

I would love to participate. Can help with Urdu.
One of the corpus I’ve seen but not tried:

hariharanrl · November 28, 2020, 4:21am

Interested to work mainly on south Indian languages

raunaqjain · November 28, 2020, 7:37am

I would love to be a part of this!

Topic		Replies	Views
Korean NLP - Introductions Languages at Hugging Face	2	1241	June 27, 2023
HuggingFace 🤗 is all you need for NLP and beyond [BLOG] 🤗Transformers	1	852	May 28, 2022
Collaborating with HuggingFace on Python Integration? Site Feedback	1	20	February 3, 2025
EMNLP Picks from the Hugging Face Science Team Research	1	4063	December 2, 2020
New disk usage quota for Hugging Face users, from December 2024 Beginners	3	176	December 11, 2024

[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

Related topics