How to run Tokenizer `tokenize()` on Arrow Data StringScalar

rhubarb · September 13, 2021, 10:38pm

Is it possible to call tokenize() from transformers.BertTokenizer on an Arrow dataset? This feature is similar, but not quite for training a tokenizer. When I attempt to run the instantiated tokenizer on a StringScalar Arrow data type instead of a Python str, I get the following error. Is there no way for me to tokenize an Arrow dataset without having to convert to a native Python data type into memory with as_py(), which is a very expensive operation?

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-48f23c831035> in <module>
      7                 results.append([])
      8             else:
----> 9                 tokens = tokenizer.tokenize(line)
     10                 tokens = tokenizer.convert_tokens_to_ids(tokens)
     11                 if tokens:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
    258             escaped_special_toks = [re.escape(s_tok) for s_tok in self.all_special_tokens]
    259             pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)"
--> 260             text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
    261 
    262         def split_on_token(tok, text):

/usr/lib64/python3.7/re.py in sub(pattern, repl, string, count, flags)
    192     a callable, it's passed the Match object and must return
    193     a replacement string to be used."""
--> 194     return _compile(pattern, flags).sub(repl, string, count)
    195 
    196 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

Topic		Replies	Views
Why do I get this error running tokenizer? Beginners	6	17760	August 20, 2020
GPT2Tokenizer not working in Kaggle Notebook 🤗Tokenizers	0	384	May 30, 2023
Programmatic way to Tokenization on Custom Text Columns 🤗Tokenizers	0	568	June 27, 2022
ArrowInvalid: Column 1 named id expected length 512 but got length 1000 🤗Datasets	4	15435	June 6, 2024
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1082	August 19, 2021

How to run Tokenizer `tokenize()` on Arrow Data StringScalar

Related topics