I have a script that I’d like to optimise, I’m trying to generate plots for a publication and I thought it was a good time to learn datasets as my brute force approach is slow - however, my datasets implementation is slower…
I have a basic loop that loops though each (text) example in my corpus, tokenises it then counts the tokens. I’m also creating some other metrics, i…e
for example in examples:
tokens = tokenise_function(example['text']) # returns array of tokens i.e. ['hello','world']
for token in tokens:
token_frequencies[token] += 1
# compute other metrics
This takes 16 minutes to run
I can speed up the tokenisation 4-fold using datasets by creating an tokens column being an array of array of strings. But then performing any operation on this array takes an inordinate amount of time. Fastest I’ve come up with is to cast it to a python list (1 minute to convert)
dataset = load_dataset("csv", data_files="data/corpus.txt", names=['text'], keep_in_memory=True)
dataset = tokenize_function(dataset)
token_frequencies = Counter()
for tokens in tqdm(list(dataset['train']['tokens']), total=len(dataset['train'])):
for token in tokens:
token_frequencies[token] += 1
Is this the best I can hope for? Iterating over each row in the dataset takes as long as my original code. I’m assuming I cannot use map?