Which summarization model of huggingface supports more than 1024 tokens? Which model is more suitable for programming related articles?

If this is not the best place to ask this question please lead me to the most accurate one.

I am planning to use huggingface summarization (Models - Hugging Face) to summarize my lecture videos transcriptions

So far I have tested facebook/bart-large-cnn and sshleifer/distilbart-cnn-12-6 but they support maximum 1024 tokens as inputs

So here my questions

1: Are there any summarization model that supports longer inputs like 10000 words article?

2: What are the optimal output lengths for given input lengths? Lets say for 1000 words input, how much should be the optimal minimum output length (the min length of the summarized text)

3: Which model would likely to work on programming related articles?

Please give me model name from this repository : Models - Hugging Face

2 Likes

Hi, not sure if you still answer to your question, however here are some options you can try:

  1. LED (16k token input length) - allenai/led-base-16384 · Hugging Face
  2. PRIMERA (~4k token input length) - allenai/PRIMERA · Hugging Face
  3. Unlimiformer (unlimited input length?) - abertsch/unlimiformer-bart-govreport-alternating · Hugging Face (read the description first!)