I am trying to tokenize some amino acid sequences using a BPE tokenizer, but it hasn’t finished training after several hours.
There are ~500,000 sequences in a .txt file formatted like so:
MKTLLLTLVVVTIVCLDLGYTLKCHNTQLPFIYNTCPEGKNLCFKATLKFPLKFPVKRGCAATCPRSSSLVKVVCCKTDKCN
MALFRKKDKYIRINPNRSRIESAPQAKPEVPDELFSKCPACKVILYKNDLGLEKTCQHCSYNFRITAQERRALTVDEGSFEELFTGIET
ADNRRPIWNLGHMVNALKQIPTFLXDGANA
I have included the code I used below
##Import Data
import pandas as pd
import numpy as np
url = 'https://www.uniprot.org/uniprot/?query=reviewed:yes&format=tab&columns=sequence'
seqs = pd.read_csv(url)
seqs_arr = seqs.to_numpy()
#take 1% sample of larger dataset for testing
n_rows = seqs_arr.shape[0]
rand_ind = np.random.choice(n_rows, size=5000, replace=False)
#seqs_arr_small = seqs_arr[rand_ind, :].shape
np.savetxt('seqs_arr_small.txt', seqs_arr_small, fmt='%s')
## Train tokenizer
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1
%%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = 'seqs_arr_small.txt'
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files=paths, vocab_size=52000, min_frequency=3, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])