Handling long text in BERT for Question Answering

I’ve read post which explains how the sliding window works but I cannot find any information on how it is actually implemented.

From what I understand if the input are too long, sliding window can be used to process the text.

Please correct me if I am wrong.
Say I have a text "In June 2017 Kaggle announced that it passed 1 million registered users".

Given some stride and max_len, the input can be split into chunks with over lapping words (not considering padding).

In June 2017 Kaggle announced that # chunk 1
announced that it passed 1 million # chunk 2
1 million registered users # chunk 3

If my questions were "when did Kaggle make the announcement" and "how many registered users" I can use chunk 1 and chunk 3 and not use chunk 2 at all in the model. Not quiet sure if I should still use chunk 2 to train the model

So the input will be:
[CLS]when did Kaggle make the announcement[SEP]In June 2017 Kaggle announced that[SEP]
and
[CLS]how many registered users[SEP]1 million registered users[SEP]


Then if I have a question with no answers do I feed it into the model with all chunks like and indicate the starting and ending index as -1? For example "can pigs fly?"

[CLS]can pigs fly[SEP]In June 2017 Kaggle announced that[SEP]

[CLS]can pigs fly[SEP]announced that it passed 1 million[SEP]

[CLS]can pigs fly[SEP]1 million registered users[SEP]

Hi @benj, Sylvain has a nice tutorial (link) on question answering that provides a lot of detail on how the sliding window approach is implemented.

The short answer is that all chunks are used to train the model, which is why there is a fairly complex amount of post-processing required to combine everything back together.

sir , I saw the article ,can you please tell after fine tuning the model , how to use it for question answering ?

hey @rohit11, i think the simplest approach would be to load the fine-tuned model and tokenizer in a question-answering pipeline as follows:

from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "/path/to/your/model"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
pipeline('question-answering', model=model, tokenizer=tokenizer)

you can consult the docs for more information on the pipeline :slightly_smiling_face: