Sliding window for Long Documents

shreyans92dhankhar · June 20, 2022, 7:51am

Hi,

Is there any way to chunk a large document with left and right context? The default param in tokenizer provide only left context, is there a way we can provide right context also and predict for the central part not for the context.

Similar to the approach mention in this paper: https://arxiv.org/pdf/2011.06993.pdf

melhoushi · February 9, 2023, 10:34pm

return_over_flowing_tokens might help:

github.com

huggingface/notebooks/blob/main/examples/question_answering.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "X4cRE8IbIrIV"
   },
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "id": "MOsHUjgdIrIW",

This file has been truncated. show original

Topic		Replies	Views
Question answering Beginners	0	290	November 1, 2021
Sequence Classification Long Documents Beginners	1	543	June 9, 2022
On-the-fly splitting for datasets with long texts 🤗Transformers	0	750	November 8, 2021
Handling long text in BERT for Question Answering Beginners	7	11911	March 10, 2022
Token Classification: How to tokenize and align labels with overflow and stride? 🤗Tokenizers	4	6130	July 22, 2024

Sliding window for Long Documents

Related topics