Identify grammatical correctness of text

jonathanalis · March 9, 2023, 12:32pm

Hello.
I am trying to analyze a text data. The data is composed of long documents broken in segments.
The segments are either has sentences, either word salads, titles, headers, footnotes, all mixed. Even most sentences are broken in smaller segments.
I wonder if there is a, unsupervised/zero-shot way to classify the segments in well defined sentences or not well defined sentences. (with the grammatical structure of well defined sentences)

I tried a nli model with the zero shot classification pipeline (an example for english, in fact I trayed in my native language and didnt work as well):

from transformers import pipeline
pipe = pipeline(model=“MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7”)
scores=pipe([‘My name is John.’, ‘This is a’, ‘paramount liv tyler’, ‘COST OF LIVING’, ‘It is raining.’,“Don’t use chatGPT.”],
candidate_labels=[‘well defined sentence’, ‘incomplete sentence’]
)
scores

Ouput:

[{‘sequence’: ‘My name is John.’,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.9764502644538879, 0.023549752309918404]},
{‘sequence’: ‘This is a’,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.8909828066825867, 0.10901718586683273]},
{‘sequence’: ‘paramount liv tyler’,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.9137822985649109, 0.0862177163362503]},
{‘sequence’: ‘COST OF LIVING’,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.8700514435768127, 0.12994852662086487]},
{‘sequence’: ‘It is raining.’,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.7896050214767456, 0.21039502322673798]},
{‘sequence’: “Don’t use chatGPT.”,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.9611700773239136, 0.03882993757724762]}]

It went terribly, always suggesting that these are incomplete sentences.
As the language model understand the natural language model, shouldnt it understand what is a well defined sentence? Or it must always been trained/finetuned for it?

Do you have suggestion? Other approaches?
Which transformers methods should I use?
Other labels (in this zero-shot classification example) could be more accurate?
I necessary need a dataset to train/fine tune?
Someone has sugestions of datasets of well defined sentences (in Brazilian portuguese)?

Thank you

Topic		Replies	Views
Model or Dataset available for classifying a grammatical sentence? Research	1	1690	February 3, 2021
Zero shot classification for long form text Beginners	4	604	July 15, 2024
Project: Create a new zero-shot model with NLI data 🤗 Course Projects	9	3652	April 11, 2023
New pipeline for zero-shot text classification 🤗Transformers	107	71680	February 17, 2025
Binary Classification Given Large Number of Tags Beginners	0	287	April 14, 2022

Identify grammatical correctness of text

Related topics