Finding number of tokens in dataset

Hello!

Is there a recommended/easy way to find the total number of tokens in a huggingface dataset?

I know it’s possible to use FreqDist in nltk, but it would be great if there was an implementation in datasets.

Thanks in advance.

Hi,

currently, there is no easy way to do that because Dataset.unique doesn’t have an option to flatten sequences, which are returned after tokenization (sequence of tokens).
However, it shouldn’t be too hard to add support for that. Then you will be able to do the following to solve your task:

def tokenize(batch):
    batch["text_col"] = [word_tokenize(text) for text in batch["text_col"]]
    return batch

dset_tokenized = dset.map(tokenize, batched=True)
unique_tokens = dset_tokenized.unique("text_col", flatten=True)  # <- new argument
fdist = FreqDist(unique_tokens)

(a link to the issue on GH where you can track progress)

Thanks for the answer!