Split document into sentences for sentence embedding

LostGoatOnHill · February 9, 2021, 10:25am

Hi,

Wondering if there is a Huggingface alternative to the Gensim split_sentences method to take a document and split into sentences ready for model.encode()?

https://www.kite.com/python/docs/gensim.summarization.textcleaner.split_sentences

A first timer says many thanks

BramVanroy · February 9, 2021, 12:10pm

So you want to split a text into sentences and then create a sentence embedding for each sentence? Just use a parser like stanza or spacy to tokenize/sentence segment your data. This is typically the first step in many NLP tasks.

LostGoatOnHill · February 9, 2021, 12:26pm

Indeed, just wondered if Huggingface had their own variant. Thanks

Topic		Replies	Views
Sentence and paragraph segmentation of Speech-to-Text output Research	0	359	July 31, 2023
Automatic sentence segmentation and encoding 🤗Tokenizers	0	841	October 12, 2020
Sentence splitting 🤗Tokenizers	7	31858	September 15, 2022
Get sentence embedding vector using API? 🤗Transformers	0	335	September 10, 2021
Return embeddings via inference api 🤗Transformers	0	371	January 17, 2023

Split document into sentences for sentence embedding

Related topics