Bart input confusion

Hildweig · September 13, 2020, 6:34pm

In Bart for the summarization task, the input length is 1024 (1024 token), what does this input represents: (for example i have a document with s1, s2, s500) does this mean that we feed a sequence as sentence per sentence or the whole document must be as input as 1024 token ( all sentences must fit with truncation ) if this true doesn’t it cause information loss ?
And if it’s sequence per sequence, let’s say 20 sentences at a time of 500, will the output at the top encoder change each time?
To be honest I’m having a difficult time imagining how the encoder is processing the document.

valhalla · September 14, 2020, 6:46am

Hi @Hildweig

For BART max input length is 1024 tokens. You can think of a token as as a word for simplicity (words can be split in multiple tokens as well). It’s not sentences.

Here document means a seq with max 1024 tokens. Processing longer sequences than that is still a topic of ongoing research.

And I would recommend to read the original Transformers paper (Attention is all you need ) to get an idea about how a sequence is processed by the encoder. Or the illustrated transformers

Hildweig · September 14, 2020, 9:36am

So does it get truncated? If yes is there a special method they use for truncation?

Topic		Replies	Views
Summarization on long documents 🤗Transformers	63	59027	August 16, 2024
Bart summarization Beginners	3	1641	August 10, 2020
Distilbart Truncation Beginners	3	289	October 22, 2020
Truncated last sentence on summaries 🤗Transformers	2	867	March 30, 2023
How I fine-tune BART for summarization using large texts? Research	3	3999	September 28, 2020

Bart input confusion

Related topics