Handling long text in BERT for Question Answering

benj · July 19, 2020, 10:52am

I’ve read post which explains how the sliding window works but I cannot find any information on how it is actually implemented.

From what I understand if the input are too long, sliding window can be used to process the text.

Please correct me if I am wrong.
Say I have a text "In June 2017 Kaggle announced that it passed 1 million registered users".

Given some stride and max_len, the input can be split into chunks with over lapping words (not considering padding).

In June 2017 Kaggle announced that # chunk 1
announced that it passed 1 million # chunk 2
1 million registered users # chunk 3

If my questions were "when did Kaggle make the announcement" and "how many registered users" I can use chunk 1 and chunk 3 and not use chunk 2 at all in the model. Not quiet sure if I should still use chunk 2 to train the model

So the input will be:
[CLS]when did Kaggle make the announcement[SEP]In June 2017 Kaggle announced that[SEP]
and
[CLS]how many registered users[SEP]1 million registered users[SEP]

Then if I have a question with no answers do I feed it into the model with all chunks like and indicate the starting and ending index as -1? For example "can pigs fly?"

[CLS]can pigs fly[SEP]In June 2017 Kaggle announced that[SEP]

[CLS]can pigs fly[SEP]announced that it passed 1 million[SEP]

[CLS]can pigs fly[SEP]1 million registered users[SEP]

lewtun · January 22, 2021, 7:50pm

Hi @benj, Sylvain has a nice tutorial (link) on question answering that provides a lot of detail on how the sliding window approach is implemented.

The short answer is that all chunks are used to train the model, which is why there is a fairly complex amount of post-processing required to combine everything back together.

rohit11 · April 29, 2021, 7:57am

sir , I saw the article ,can you please tell after fine tuning the model , how to use it for question answering ?

lewtun · April 30, 2021, 7:12pm

hey @rohit11, i think the simplest approach would be to load the fine-tuned model and tokenizer in a question-answering pipeline as follows:

from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "/path/to/your/model"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
pipeline('question-answering', model=model, tokenizer=tokenizer)

you can consult the docs for more information on the pipeline

Sam2021 · August 13, 2021, 4:33pm

Hi

When handling long context by breaking it into chunk where each chunk can be seprate example , is it likely that multiple chunks can have the answer independently or the answer can lie in an overlapping fashion spanning two chunks ?

benj · August 24, 2021, 7:26am

You would want to find the “best span”. Check Google’s implementation of Squad here.

dheeraja486 · January 19, 2022, 2:37pm

@Sam2021 As you can see in the notebook, they are finding answers in each chucks with modified start and end token. Also that’s why they are using offset tokens which maps tokens to original text.
Answers do not get divided into two chucks that’s why they are using doc_stride which basically mitigate this problem

yasminebe · March 10, 2022, 9:24am

Hello, I am having an issue with the same situation. I have long paragraphs from which i want to extract the answer. I split the paragraphs into chunks and used the (start_end) with the highest probabilities in each chunk (for each paragraph) . I saw the notebook and the solution of using the n_best (start_end) and check the score of all the combinations is interesting. But I wonder if there is a more elegant way to decide about which (start_end) token to choose among the chunk.
Is there an official implementation of this problem?
the best span function implemented in the official code doesn’t really solve the problem of having start index in chunk 2 and the end index in chunk 9 for example. It just solves the problem of picking the righ sub-paragraph.
Thanks in advance.

Topic		Replies	Views
Sliding Transformer into a long sequence Models	3	663	August 20, 2022
Implementing sliding window to BERT for NER Beginners	0	977	May 31, 2023
Sliding Window Approach for Multilabel Classification Beginners	0	550	July 21, 2023
Question about BERT for qa Beginners	0	593	June 30, 2022
Sliding Window - Multilabel Classification Beginners	0	370	July 25, 2023

Handling long text in BERT for Question Answering

Related topics