Convert a Python Tokenizer into a TokenizerFast

varun · May 20, 2022, 5:34am

Hi!

I’m trying to finetune Transformer-XL on SQUAD for question answering according to the tutorial in the HuggingFace course here.

Unfortunately, Transformer-XL does not have a TokenizerFast, and the tutorial requires the use of return_offsets_mapping for preprocessing the finetuning data, an option that is only available in TokenizerFast. The specific preprocessing code can be seen in the course (linked above) and at the end of this post.

Is there any way that I can take the existing TransformerXL Tokenizer and convert it to a Fast Tokenizer?

Simply loading via AutoTokenizer.from_pretrained("transfo-xl-wt103", use_fast=True) does not load a TokenizerFast (tokenizer.is_fast is False).

Thanks so much!

def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

Topic		Replies	Views
How to convert Tokenizer to TokenizerFast? Beginners	1	551	September 30, 2020
Tokenizer.encode not returning encodings 🤗Tokenizers	2	908	October 9, 2021
Xlm-Roberta Tokenizing 🤗Transformers	3	482	January 19, 2021
Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library 🤗Tokenizers	1	1102	August 30, 2021
Different Behaviors between Tokenizers for Question Answering 🤗Transformers	0	341	October 20, 2021

Convert a Python Tokenizer into a TokenizerFast

Related topics