Tokenizer from scratch Error TypeError: Can't convert None to PyString

Hi,

I tried to train a tokenizer from scratch with my own created dataset.
The dataset has the following structure:

DatasetDict({
    train: Dataset({
        features: ['slug', 'date', 'type', 'content', 'word_count', '__index_level_0__'],
        num_rows: 3048752
    })
})

I now tried to train a ByteLevelBPETokenizer from scratch, but I always get the Error “TypeError: Can’t convert None to pystring”

from transformers import AutoTokenizer
from datasets import load_dataset
from tokenizers import ByteLevelBPETokenizer
from tqdm import tqdm



raw_dataset = load_dataset('drsis/law_de_courtdecisions')

def get_training_corpus():
    dataset = raw_dataset['train']
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples['content']

## Initialize an empty tokenizer
tokenizer = ByteLevelBPETokenizer(add_prefix_space=True)

# Train the tokenizer on the corpus using the generator get_training_corpus as a source
tokenizer.train_from_iterator(get_training_corpus(), 
            vocab_size=52_000, 
            min_frequency=2, 
            special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
            show_progress=True)

Has somebody an idea what the issue is. I tried to go over the whole content key of the dataset and attempted to find out if there is some None data, but this was not the case.

I would appreciate it if somebody could help me out.

Thanks

Found the error by myself.
There are 171 empty elements in the dataset which caused the TypeError.

2 Likes