Nlp Datasets: speed-test vs Fastai

I was playing around with nlp Datasets library and was seriously impressed by the speed!!

I figured it would be interesting to test it out to see if it would make more sense to do as much text processing (e.g. cleaning, tokenization, numericalisation) with it, instead of using Fastai’s defaults. I used fastai’s TextDataloader with all of its defaults and tried to replicate all its functionality with nlp Datasets

Full blog post here

Curious if anyone has any feedback or how this test might have been done better, especially how any pointers on how to parallelise tokenisation with nlp Datasets :slight_smile:

Just tell me the results

Results were…mixed…

Fastai’s initialisation (e.g. load, preprocess, tokenize etc) was faster with the 1.6M row Sentiment140 dataset I used, however I have a few caveats:


Fastai parallelises the tokenization, which I couldn’t figure out how to do with nlp Datasets (probably my own lack of knowledge and not a limitation of the library though). My guess is that doing so would likely make nlp Datasets much faster than Fastai

Sorting by sample length

To try and replicate SortedDL's behaviour, I sorted the entire dataset in the nlp Dataset trial, which added a significant amount of time, possibly theres a way to better replicated SortedDL's behaviour


nlp Datasets also uses caching so that the second time around you’d like to do the same pre-processing etc, it is much much faster

10% Data

0.16M ROWS: Init (s) 1 epoch (s) 1 mini-batch [bs=64] (ms)
Fastai 124 14.3 7.4
Fastai w/sorted 48.1 14.3 7.4
nlp 71.2 11.3 5.6

100% Data

1.6M ROWS: Init (s) 1 epoch (s)
Fastai w/sorted 484 142
nlp 1024 323

Any and all feedback welcome|!

(the forums auto-correct “nlp” in my post title to “Nlp” haha)


Hi there :slight_smile:
Thanks for doing this speed comparison ! This is important for us to make sure we achieve the fastest read/write/process actions we can offer using the power of apache arrow, and with the minimum memory.
We plan to add multiprocessing in the very short term that will speed up processing significantly :smiley:

Also, out of curiosity, did you try to process the dataset in memory with :hugs:nlp, just to get an idea of the difference of speed ? By default it uses memory-mapping which is really fast and uses almost no memory, but it could be interesting for the users that don’t really care about memory usage.
You can do that by specifying keep_in_memory=True in .sort() and .map().

Thanks, was a fun experiment

No I haven’t, but I’ll give it a try and report back

Very nice!

Does tokenization in the nlp package not use the fast RUST tokenizers?

1 Like

AFAIK, nlp doesn’t provide tokenizers, we can use any tokenizer we want with nlp, it’s possible to use FAST tokenizers


We’re working on some speed ups to better support the tokenizers from transformers or from the tokenizers library.

Last improvement (here) implied a x10 speed up on using a tokenizer by removing unnecessary conversions when reading/writing from Arrow format, and will be available in the next release, along with multiprocessing :slight_smile:

Our goal is to be as close as possible to the optimal conditions for tokenization.