Generating Vocabulary using Datasets

david-waterworth · August 24, 2022, 1:11am

I have a script that I’d like to optimise, I’m trying to generate plots for a publication and I thought it was a good time to learn datasets as my brute force approach is slow - however, my datasets implementation is slower…

I have a basic loop that loops though each (text) example in my corpus, tokenises it then counts the tokens. I’m also creating some other metrics, i…e

for example in examples:
    tokens = tokenise_function(example['text']) # returns array of tokens i.e. ['hello','world']
    for token in tokens:
        token_frequencies[token] += 1
    # compute other metrics

This takes 16 minutes to run

I can speed up the tokenisation 4-fold using datasets by creating an tokens column being an array of array of strings. But then performing any operation on this array takes an inordinate amount of time. Fastest I’ve come up with is to cast it to a python list (1 minute to convert)

dataset = load_dataset("csv", data_files="data/corpus.txt", names=['text'], keep_in_memory=True)
dataset = tokenize_function(dataset)

  token_frequencies = Counter()
  for tokens in tqdm(list(dataset['train']['tokens']), total=len(dataset['train'])):
      for token in tokens:
          token_frequencies[token] += 1

Is this the best I can hope for? Iterating over each row in the dataset takes as long as my original code. I’m assuming I cannot use map?

lhoestq · August 30, 2022, 10:04am

You can use map to tokenize the dataset, and as a nice optimization you can set the dataset to output numpy arrays, which is much faster that outputting python lists. Indeed a Dataset is a wrapper or an Arrow table, and Arrow data can be converted to numpy arrays for free, without conversion

dataset["train"].set_format("numpy")
tokens = dataset["train"]["tokens"]

Topic		Replies	Views
Nlp Datasets: speed-test vs Fastai 🤗Datasets	6	1131	August 24, 2020
Why dataset iterating is so slow? 🤗Datasets	1	2072	January 3, 2023
Smarter way to load C4 dataset 🤗Datasets	1	823	November 6, 2023
Fetching rows of a large Dataset by index 🤗Datasets	10	1658	March 15, 2021
Tokenizer performance is slow, after call to dataset map 🤗Datasets	0	183	June 15, 2024

Generating Vocabulary using Datasets

Related topics