Tokenizer from scratch Error TypeError: Can't convert None to PyString

drsis · December 25, 2022, 8:31am

Hi,

I tried to train a tokenizer from scratch with my own created dataset.
The dataset has the following structure:

DatasetDict({
    train: Dataset({
        features: ['slug', 'date', 'type', 'content', 'word_count', '__index_level_0__'],
        num_rows: 3048752
    })
})

I now tried to train a ByteLevelBPETokenizer from scratch, but I always get the Error “TypeError: Can’t convert None to pystring”

from transformers import AutoTokenizer
from datasets import load_dataset
from tokenizers import ByteLevelBPETokenizer
from tqdm import tqdm



raw_dataset = load_dataset('drsis/law_de_courtdecisions')

def get_training_corpus():
    dataset = raw_dataset['train']
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples['content']

## Initialize an empty tokenizer
tokenizer = ByteLevelBPETokenizer(add_prefix_space=True)

# Train the tokenizer on the corpus using the generator get_training_corpus as a source
tokenizer.train_from_iterator(get_training_corpus(), 
            vocab_size=52_000, 
            min_frequency=2, 
            special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
            show_progress=True)

Has somebody an idea what the issue is. I tried to go over the whole content key of the dataset and attempted to find out if there is some None data, but this was not the case.

I would appreciate it if somebody could help me out.

Thanks

drsis · December 26, 2022, 8:04am

Found the error by myself.
There are 171 empty elements in the dataset which caused the TypeError.

Topic		Replies	Views
Train_from_iterator throwing TypeError: expected string or buffer errir 🤗Tokenizers	2	23	January 3, 2025
NLP dataset for ByteLevelTokenizer Training 🤗Datasets	1	2096	February 16, 2021
Having an issue with 'NoneType' after using to_df_dataset() function Beginners	3	3083	January 13, 2024
HuggingFace BPE Trainer Error - Training Tokenizer 🤗Tokenizers	1	2997	July 14, 2022
RuntimeError: Could not infer dtype of NoneType 🤗Transformers	1	2279	June 24, 2024

Tokenizer from scratch Error TypeError: Can't convert None to PyString

Related topics