Tokenizer taking extremely long time to train

I am trying to tokenize some amino acid sequences using a BPE tokenizer, but it hasn’t finished training after several hours.

There are ~500,000 sequences in a .txt file formatted like so:

I have included the code I used below

##Import Data

import pandas as pd
import numpy as np

url = ''

seqs = pd.read_csv(url)

seqs_arr = seqs.to_numpy()

#take 1% sample of larger dataset for testing 
n_rows = seqs_arr.shape[0]
rand_ind = np.random.choice(n_rows, size=5000, replace=False)

#seqs_arr_small = seqs_arr[rand_ind, :].shape

np.savetxt('seqs_arr_small.txt', seqs_arr_small, fmt='%s')

## Train tokenizer

# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = 'seqs_arr_small.txt'

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52000, min_frequency=3, special_tokens=[