Finding number of tokens in dataset

navjordj · November 19, 2021, 7:32am

Hello!

Is there a recommended/easy way to find the total number of tokens in a huggingface dataset?

I know it’s possible to use FreqDist in nltk, but it would be great if there was an implementation in datasets.

Thanks in advance.

mariosasko · November 19, 2021, 1:19pm

Hi,

currently, there is no easy way to do that because Dataset.unique doesn’t have an option to flatten sequences, which are returned after tokenization (sequence of tokens).
However, it shouldn’t be too hard to add support for that. Then you will be able to do the following to solve your task:

def tokenize(batch):
    batch["text_col"] = [word_tokenize(text) for text in batch["text_col"]]
    return batch

dset_tokenized = dset.map(tokenize, batched=True)
unique_tokens = dset_tokenized.unique("text_col", flatten=True)  # <- new argument
fdist = FreqDist(unique_tokens)

(a link to the issue on GH where you can track progress)

navjordj · November 19, 2021, 1:28pm

Thanks for the answer!

Topic		Replies	Views
Access word piece tokens from BERT tokenized dataset 🤗Datasets	2	930	November 17, 2021
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10372	August 10, 2023
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12715	October 6, 2021
Use dataset.map for ngrams and Word2Vec style data pipeline Beginners	0	883	April 26, 2021
How did the dataset manages long sentences? 🤗Datasets	1	985	February 15, 2022

Finding number of tokens in dataset

Related topics