Nlp 0.3.0 is out!


New methods to transform a dataset:

  • dataset.shuffle: create a shuffled dataset
  • dataset.train_test_split: create a train and a test split (similar to sklearn)
  • dataset.sort: create a dataset sorted according to a certain column
  • create a dataset with rows selected following the given list of indices

Other features:

  • Better instructions for datasets that require manual download

    Important: if you load datasets that require manual downloads with an older version of nlp, instructions won’t be shown and an error will be raised

  • Better access to dataset information (for instance dataset.feature['label'] or dataset.dataset_size)


  • New: cos_e v1.0
  • New: rotten_tomatoes
  • New: german and italian wikipedia

New docs:

  • documentation about splitting a dataset

Bug fixes:

  • fix metric.compute that couldn’t write on file
  • fix squad_v2 imports

Nice, enjoying using nlp already!

Quick question, what is the vision for the nlp library? Will it’s main focus be in curating existing datasets or might it evolve into a more general purpose PyArrow wrapper for any (text?) dataset? I’m just blown away by its speed and it would be amazing to be able to do the same with my own text datasets.

I know I could already just start using PyArrow (as below) but I have a feeling that the nlp library might have more text-specific functionality coming down the line that would be amazing to be able to use with my own data…

table = pa.Table.from_pandas(df)

Yes definitely, it’s already possible to load your own CSV or JSON files like this:

from nlp import load_dataset

dataset = load_dataset('csv', data_files='my_file.csv')
dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])
dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'], 
                                          'test': 'my_test_file.csv'})

(replace with json, and soon with pandas as well, for loading from JSON and pandas files)
and we plan to add more options to load datasets from your own data, both from external files and from data which is already loaded in memory in your python session.

For data which is already in memory like a python dict or a pandas dataframe you can have a look at the PR on this here: which should be merged soon.

Overall we want to add more doc and examples of use-cases very soon.

Other exciting topics coming soon for the library are:

  • simple and efficient ways to index, encode and query datasets records
  • tracability and reproductibility features.

:exploding_head: amazing, thanks for the work on this!