Generating Vocabulary using Datasets

I have a script that I’d like to optimise, I’m trying to generate plots for a publication and I thought it was a good time to learn datasets as my brute force approach is slow - however, my datasets implementation is slower…

I have a basic loop that loops though each (text) example in my corpus, tokenises it then counts the tokens. I’m also creating some other metrics, i…e

for example in examples:
    tokens = tokenise_function(example['text']) # returns array of tokens i.e. ['hello','world']
    for token in tokens:
        token_frequencies[token] += 1
    # compute other metrics

This takes 16 minutes to run

I can speed up the tokenisation 4-fold using datasets by creating an tokens column being an array of array of strings. But then performing any operation on this array takes an inordinate amount of time. Fastest I’ve come up with is to cast it to a python list (1 minute to convert)

dataset = load_dataset("csv", data_files="data/corpus.txt", names=['text'], keep_in_memory=True)
dataset = tokenize_function(dataset)

  token_frequencies = Counter()
  for tokens in tqdm(list(dataset['train']['tokens']), total=len(dataset['train'])):
      for token in tokens:
          token_frequencies[token] += 1

Is this the best I can hope for? Iterating over each row in the dataset takes as long as my original code. I’m assuming I cannot use map?

You can use map to tokenize the dataset, and as a nice optimization you can set the dataset to output numpy arrays, which is much faster that outputting python lists. Indeed a Dataset is a wrapper or an Arrow table, and Arrow data can be converted to numpy arrays for free, without conversion

dataset["train"].set_format("numpy")
tokens = dataset["train"]["tokens"]