Is detokenize available in transformer lib?

innat · March 25, 2023, 6:24am

I’ve searched on doc but couldn’t find any hint.

Generally, detokenize is the inverse of the tokenize method, and can be used to reconstrct a string from a set of tokens.

from transformers import TFBertTokenizer

tf_tokenizer = TFBertTokenizer.from_pretrained("bert-base-uncased")

# something like
tf_tokenizer.encode([string]) # o/p: ids / token
tf_tokenizer.decode([1,2,3]) # o/p: string

Available in some extent

github.com

keras-team/keras-nlp/blob/b35e83fbc1/keras_nlp/models/bert/bert_tokenizer_test.py

# Copyright 2023 The KerasNLP Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tests for BERT tokenizer."""

import os

import pytest
import tensorflow as tf
from absl.testing import parameterized

This file has been truncated. show original

github.com

tensorflow/text/blob/dc5983e1be3140130f4eb70e981c7cb0fecadd85/tensorflow_text/python/ops/tokenization.py#L179


      
          ...   def detokenize(self, input):
          ...     return tf.strings.reduce_join(input, axis=-1, separator=" ")
          >>> text = tf.ragged.constant([["hello", "world"], ["a", "b", "c"]])
          >>> print(SimpleDetokenizer().detokenize(text))
          tf.Tensor([b'hello world' b'a b c'], shape=(2,), dtype=string)
          """
          
          
__metaclass__ = abc.ABCMeta
          
          
@abc.abstractmethod
          def detokenize(self, input):  # pylint: disable=redefined-builtin
            """Assembles the tokens in the input tensor into a string.
          
          
  Generally, `detokenize` is the inverse of the `tokenize` method, and can
            be used to reconstrct a string from a set of tokens.  This is especially
            helpful in cases where the tokens are integer ids, such as indexes into a
            vocabulary table -- in that case, the tokenized encoding is not very
            human-readable (since it's just a list of integers), so the `detokenize`
            method can be used to turn it back into something that's more readable.
          
          
  Args:

sgugger · March 27, 2023, 12:40pm

cc @Rocketknight1

Rocketknight1 · April 24, 2023, 1:09pm

Hi @innat, and sorry for the delay! I don’t think our TF in-graph tokenizers support decoding/detokenization. However, our main tokenizers do. So you could do something like

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer.decode([1, 2, 3])

This should work for most purposes - do you have a usecase for wanting to do detokenization inside a TF graph? We’re very interested if so, because we assumed people would generally not need that!

Topic		Replies	Views
Detokenising output of Roberta tokeniser Beginners	0	454	April 6, 2022
Efficient detokenization method 🤗Transformers	3	2079	January 28, 2021
Issue with Decoding in HuggingFace 🤗Tokenizers	2	3906	March 24, 2022
How to create a Huggingface tokenizer from a non-Huggingface tokenizer? 🤗Tokenizers	0	530	May 4, 2021
Convert tokens and token-labels to string 🤗Transformers	7	7674	March 12, 2022

Is detokenize available in transformer lib?

Related topics