Quick question, what is the vision for the nlp library? Will it’s main focus be in curating existing datasets or might it evolve into a more general purpose PyArrow wrapper for any (text?) dataset? I’m just blown away by its speed and it would be amazing to be able to do the same with my own text datasets.
I know I could already just start using PyArrow (as below) but I have a feeling that the nlp library might have more text-specific functionality coming down the line that would be amazing to be able to use with my own data…
(replace with json, and soon with pandas as well, for loading from JSON and pandas files)
and we plan to add more options to load datasets from your own data, both from external files and from data which is already loaded in memory in your python session.
For data which is already in memory like a python dict or a pandas dataframe you can have a look at the PR on this here: https://github.com/huggingface/nlp/pull/350 which should be merged soon.
Overall we want to add more doc and examples of use-cases very soon.
Other exciting topics coming soon for the library are:
simple and efficient ways to index, encode and query datasets records