I’m trying to run the question answering pipeline on chunks of of a long document. The DistilBertTokenizerFast has good features to support this. It can break a document down into overlapping chunks and it can give you the offsets mapping so that you can match up labels with tokens. But in the pipeline they use by default the DistilBertTokenizer, which has neither of these features. Is there a better way than below to find out what model is being used in the pipeline and construct a “fast” tokenizer to match?
from transformers import pipeline
from transformers import DistilBertTokenizerFast
nlp = pipeline("question-answering")
regular_tokenizer = nlp.tokenizer
# look up which model is being used on github
# https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines.py
# currently the model is:
model_name = "distilbert-base-cased-distilled-squad"
fast_tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)