Hi everyone,
I’m working on a summarization project and am trying to accurately capture coreferences across multiple sentences to improve coherence in summary outputs. I need a way to group sentences that rely on each other (for instance if a second sentence must have the first one in order to make sense) example:
Jay joined the Tonight Show on September. he was on the show for 20 years or so.
so the second sentence (he was on the show for 20 years or so.) will not make sense on its own in extractive summary, i want to identify that it is strongly depends on the previous sentence and group them like this:
Jay joined the Tonight Show on September, he was on the show for 20 years or so.
(^^ i have replaced the . with a , to join those two sentences together before preprocessing, selecting most important sentences and summarizing)
What I’ve Tried So Far:
- Stanford CoreNLP: I used CoreNLP’s coreference system, but it seems to identify coreferences mainly within individual sentences and fails to link entities across sentences. I’ve experimented with various chunk sizes to no avail.
- spaCy with neuralcoref: This had some success with single pronoun references, but it struggled with document-level coherence, especially with more complex coreference chains involving entity aliases or nested references.
- AllenNLP CorefPredictor: I attempted this as well, but the results were inconsistent, and it didn’t capture some key cross-sentence coreferences that were crucial for summary cohesion.
- Huggingface neuralcoref: this is so old and not updated that even the install on python 3.12+ is failing
My Setup:
- Python, mostly using Hugging Face Transformers.
- Willing to work with custom setups if needed, or explore pre-trained models specific to coreference if available on the Hub.
If anyone has experience with a reliable setup for coreference that works well with multi-sentence contexts, or if there’s a fine-tuned model you’d recommend, I’d really appreciate your insights!
Thank you in advance for any guidance or suggestions!