Nlp 0.3.0 is out!

sgugger · July 7, 2020, 6:45pm

Features:

New methods to transform a dataset:

dataset.shuffle: create a shuffled dataset
dataset.train_test_split: create a train and a test split (similar to sklearn)
dataset.sort: create a dataset sorted according to a certain column
dataset.select: create a dataset with rows selected following the given list of indices

Other features:

Better instructions for datasets that require manual download

Important: if you load datasets that require manual downloads with an older version of nlp, instructions won’t be shown and an error will be raised
Better access to dataset information (for instance dataset.feature['label'] or dataset.dataset_size)

Datasets:

New: cos_e v1.0
New: rotten_tomatoes
New: german and italian wikipedia

New docs:

documentation about splitting a dataset

Bug fixes:

fix metric.compute that couldn’t write on file
fix squad_v2 imports

morgan · July 8, 2020, 11:22am

Nice, enjoying using nlp already!

Quick question, what is the vision for the nlp library? Will it’s main focus be in curating existing datasets or might it evolve into a more general purpose PyArrow wrapper for any (text?) dataset? I’m just blown away by its speed and it would be amazing to be able to do the same with my own text datasets.

I know I could already just start using PyArrow (as below) but I have a feeling that the nlp library might have more text-specific functionality coming down the line that would be amazing to be able to use with my own data…

table = pa.Table.from_pandas(df)

thomwolf · July 8, 2020, 12:32pm

Yes definitely, it’s already possible to load your own CSV or JSON files like this:

from nlp import load_dataset

dataset = load_dataset('csv', data_files='my_file.csv')
dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])
dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'], 
                                          'test': 'my_test_file.csv'})

(replace with json, and soon with pandas as well, for loading from JSON and pandas files)
and we plan to add more options to load datasets from your own data, both from external files and from data which is already loaded in memory in your python session.

For data which is already in memory like a python dict or a pandas dataframe you can have a look at the PR on this here: https://github.com/huggingface/nlp/pull/350 which should be merged soon.

Overall we want to add more doc and examples of use-cases very soon.

Other exciting topics coming soon for the library are:

simple and efficient ways to index, encode and query datasets records
tracability and reproductibility features.

morgan · July 8, 2020, 9:19pm

amazing, thanks for the work on this!

Topic		Replies	Views
Sentence Order Prediction - Dataset Creation 🤗Datasets	1	678	October 21, 2021
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5612	September 18, 2020
NLP for Summarization and classification Beginners	4	63	January 22, 2025
Unable to use custom dataset when training a tokenizer Beginners	2	362	August 11, 2021
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12707	October 6, 2021