Is it possible to call tokenize()
from transformers.BertTokenizer
on an Arrow dataset? This feature is similar, but not quite for training a tokenizer. When I attempt to run the instantiated tokenizer on a StringScalar
Arrow data type instead of a Python str
, I get the following error. Is there no way for me to tokenize an Arrow dataset without having to convert to a native Python data type into memory with as_py()
, which is a very expensive operation?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-48f23c831035> in <module>
7 results.append([])
8 else:
----> 9 tokens = tokenizer.tokenize(line)
10 tokens = tokenizer.convert_tokens_to_ids(tokens)
11 if tokens:
/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
258 escaped_special_toks = [re.escape(s_tok) for s_tok in self.all_special_tokens]
259 pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)"
--> 260 text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
261
262 def split_on_token(tok, text):
/usr/lib64/python3.7/re.py in sub(pattern, repl, string, count, flags)
192 a callable, it's passed the Match object and must return
193 a replacement string to be used."""
--> 194 return _compile(pattern, flags).sub(repl, string, count)
195
196 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object