Nlp Datasets: speed-test vs Fastai

lhoestq · July 23, 2020, 8:22am

Also, out of curiosity, did you try to process the dataset in memory with nlp, just to get an idea of the difference of speed ? By default it uses memory-mapping which is really fast and uses almost no memory, but it could be interesting for the users that don’t really care about memory usage.
You can do that by specifying keep_in_memory=True in .sort() and .map().

Topic		Replies	Views
Hugdatafast: hugginface/nlp + fastai 🤗Datasets	1	1512	September 8, 2020
Tokenizer dataset is very slow 🤗Tokenizers	3	4317	March 2, 2024
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2237	November 11, 2024
Difference between tokenizer and tokenizerfast Beginners	4	4233	December 22, 2023
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	888	December 23, 2024

Nlp Datasets: speed-test vs Fastai

Related topics