[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

Yea im too interestedšŸ¤©

1 Like

Hi, I would love to join the effort. We have few Hebrew datasets that will be very useful for Hebrew users: https://github.com/NLPH/NLPH_Resources

2 Likes

This is great! I would love contribute in this work

1 Like

I think I can integrate the (or a) Finnish NER dataset: ā€œA Finnish News Corpus for Named Entity Recognitionā€.

Repository can be found under:

with all necessary data :hugs:

2 Likes

Iā€™m happy to see many people who would like to participate and create new datasets.

However, before that, I do not want to be a downer, but I have been adding datasets for a couple of months now, and there are some serious drawbacks that I believe need to be resolved in order for hf/datasets to lend itself into more complex, not text-only datasets, and I donā€™t think newcomers should have to face.

Issues I think are problematic:

  1. issue 885 - when developing a dataset, and messing a bit with the code to see if something was done correctly, unfortunately just importing ā€œdatasetsā€ takes so long that its discouraging to keep developing.
  2. Supporting sequence of arrays/dynamic arrays - issue 887 - is not yet implemented. This makes it a problem to store images, videos, pose estimations, point clouds, etc, and we have to resort to ugly hacks.
  3. No support for multiprocessing - issue 786 - I think its crucial, for large datasets, that you support multiprocessing. Iā€™m working with very complex datasets, that processing each sample takes some time, and I canā€™t wait 24 hours to load the entire dataset to make sure its good, or canā€™t expect users to do the same.
  4. Dataset write batch size is not currently working - issue 741 - making it impossible to save datasets that donā€™t completely fit into memory (even using 500GB of memory). This automatically, I think, disallows any video datasets.

I say this with all the love, I really like this repo, and I think it can do better to accommodate varied types/sizes of datasets.

1 Like

Keep to contribute, count me in :slight_smile:

1 Like

I can help out with polish datasets. Hereā€™s a list we can use https://github.com/ksopyla/awesome-nlp-polish

1 Like

I would love to contribute to this!

1 Like

:star_struck:love to participate

1 Like

Would love to help, especially with Indic languages!

1 Like

Hi @AmitMY, thanks for your feedback. We value community feedback a lot and we are constantly working on improving on what we already have.
Sorry that weā€™ve not be able to add your video datasets yet. Sign language datasets are something valuable and weā€™re looking forward to be able to share them with the community through datasets.
Thatā€™s why weā€™ll work hard on the points you mentioned : 2 and 4 for video datasets and 1 and 3 to speed up things even more !
I would also like to mention that improvement of import time is already in our pipeline!
And that this sprint is for adding text datasets only and we hope help from the community is going to be very valuable in improving the coverage provided by the datasets library when it comes to different languages and different types of text/nlp problems.
Feel free to ping me or the team on github if you find issues or if you have ideas of things to improve.

1 Like

Hello @thomwolf & @lhoestq ,

Iā€™ve been scraping a large part of major French review websites over the past months. Thus, Iā€™ve now millions of raw [ review + score ] ideally suited for Sentiment Analysis.

I took as a model your own Allocine dataset, but mine is ~30 times bigger.

If you think it might be useful, I would be glad to contribute !

Cheers,
Adam

1 Like

Would love to contribute!

1 Like

Count me in!

1 Like

Hi! Would love to be a part of this and contribute.

1 Like

Hi! Iā€™m also eager to help! Do you need any support with polish datasets?

1 Like

I would love to participate. Can help with Urdu.
One of the corpus Iā€™ve seen but not tried:

1 Like

Interested to work mainly on south Indian languages

1 Like

I would love to be a part of this!

1 Like