Hi,
I tried to train a tokenizer from scratch with my own created dataset.
The dataset has the following structure:
DatasetDict({
train: Dataset({
features: ['slug', 'date', 'type', 'content', 'word_count', '__index_level_0__'],
num_rows: 3048752
})
})
I now tried to train a ByteLevelBPETokenizer from scratch, but I always get the Error “TypeError: Can’t convert None to pystring”
from transformers import AutoTokenizer
from datasets import load_dataset
from tokenizers import ByteLevelBPETokenizer
from tqdm import tqdm
raw_dataset = load_dataset('drsis/law_de_courtdecisions')
def get_training_corpus():
dataset = raw_dataset['train']
for start_idx in range(0, len(dataset), 1000):
samples = dataset[start_idx : start_idx + 1000]
yield samples['content']
## Initialize an empty tokenizer
tokenizer = ByteLevelBPETokenizer(add_prefix_space=True)
# Train the tokenizer on the corpus using the generator get_training_corpus as a source
tokenizer.train_from_iterator(get_training_corpus(),
vocab_size=52_000,
min_frequency=2,
special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
show_progress=True)
Has somebody an idea what the issue is. I tried to go over the whole content key of the dataset and attempted to find out if there is some None data, but this was not the case.
I would appreciate it if somebody could help me out.
Thanks