HuggingFace BPE Trainer Error - Training Tokenizer

mgiardinelli · October 8, 2021, 9:07pm

I am trying to train a ByteLevelBPETokenizer using an iterable instead of from files. There must be something I am doing wrong when I instantiate the trainer, but I can’t tell what it is. When I try to train the tokenizer with my dataset (clothing data from Kaggle) + the BpeTrainer, I get an error.

Also posted on Stackoverflow: nlp - HuggingFace BPE Trainer Error - Training Tokenizer - Stack Overflow

**TypeError**: 'tokenizers.trainers.BpeTrainer' object cannot be interpreted as an integer

I am using Colab

Step 1: Install tokenizers & download the Kaggle data

!pip install tokenizers

# Download clothing data from Kaggle
# https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/version/1?select=Womens+Clothing+E-Commerce+Reviews.csv

Step 2: Upload the file

# use colab file upload
from google.colab import files
uploaded = files.upload()

Step 3: Clean the data (remove floats) & run trainer

import io
import pandas as pd  

# convert the csv to a dataframe so it can be parsed
data = io.BytesIO(uploaded['clothing_dataset.csv']) 
df = pd.read_csv(data)

# convert the review text to a list so it can be passed as iterable to tokenizer
clothing_data = df['Review Text'].to_list()

# Remove float values from the data
clean_data =  []    
for item in clothing_data:
  if type(item) != float:
    clean_data.append(item)   


from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
from tokenizers import trainers, pre_tokenizers
from tokenizers.trainers import BpeTrainer
from pathlib import Path


# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer(lowercase=True)

# Intantiate BpeTrainer
trainer = BpeTrainer(
    vocab_size=20000,
    min_frequence = 2,
    show_progress=True,
    special_tokens=["<s>","<pad>","</s>","<unk>","<mask>"],)

# Train the tokenizer
tokenizer.train_from_iterator(clean_data, trainer)

Error - I can see that the trainer is a BpeTrainer Type.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-103-7738a7becb0e> in <module>()
     34 
     35 # Train the tokenizer
---> 36 tokenizer.train_from_iterator(clean_data, trainer)

/usr/local/lib/python3.7/dist-packages/tokenizers/implementations/byte_level_bpe.py in train_from_iterator(self, iterator, vocab_size, min_frequency, show_progress, special_tokens)
    119             show_progress=show_progress,
    120             special_tokens=special_tokens,
--> 121             initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    122         )
    123         self._tokenizer.train_from_iterator(iterator, trainer=trainer)

TypeError: 'tokenizers.trainers.BpeTrainer' object cannot be interpreted as an integer

Interesting Note: If I set the input trainer=trainer I get this

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-104-64737f948e6d> in <module>()
     34 
     35 # Train the tokenizer
---> 36 tokenizer.train_from_iterator(clean_data, trainer=trainer)

TypeError: train_from_iterator() got an unexpected keyword argument 'trainer'

drsis · July 14, 2022, 6:35pm

A little late but I think you have to call the train_from_iterator in such a way:

 tokenizer.train_from_iterator(clean_data, vocab_size=20000,
     min_frequence = 2,
     show_progress=True,
     special_tokens=["<s>","<pad>","</s>","<unk>","<mask>"])

You can see from the sourcecode (tokenizers/byte_level_bpe.py at main · huggingface/tokenizers · GitHub) how the method from ByteLevelBPETokenizer has to be called.

Topic		Replies	Views
NLP dataset for ByteLevelTokenizer Training 🤗Datasets	1	2076	February 16, 2021
Does the ByteLevelBPETokenizer need to be wrapped in a normal Tokenizer? 🤗Tokenizers	0	1807	March 18, 2023
Issue with Transformer notebook's Getting Started Tokenizers Beginners	2	2123	January 30, 2021
Two approaches to training a tokenizer Beginners	0	974	March 6, 2023
IndexError while training Roberta with a custom tokenizer Beginners	8	1145	December 17, 2023

HuggingFace BPE Trainer Error - Training Tokenizer

Related topics