Hello.
I am trying to analyze a text data. The data is composed of long documents broken in segments.
The segments are either has sentences, either word salads, titles, headers, footnotes, all mixed. Even most sentences are broken in smaller segments.
I wonder if there is a, unsupervised/zero-shot way to classify the segments in well defined sentences or not well defined sentences. (with the grammatical structure of well defined sentences)
I tried a nli model with the zero shot classification pipeline (an example for english, in fact I trayed in my native language and didnt work as well):
from transformers import pipeline
pipe = pipeline(model=“MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7”)
scores=pipe([‘My name is John.’, ‘This is a’, ‘paramount liv tyler’, ‘COST OF LIVING’, ‘It is raining.’,“Don’t use chatGPT.”],
candidate_labels=[‘well defined sentence’, ‘incomplete sentence’]
)
scores
Ouput:
[{‘sequence’: ‘My name is John.’,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.9764502644538879, 0.023549752309918404]},
{‘sequence’: ‘This is a’,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.8909828066825867, 0.10901718586683273]},
{‘sequence’: ‘paramount liv tyler’,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.9137822985649109, 0.0862177163362503]},
{‘sequence’: ‘COST OF LIVING’,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.8700514435768127, 0.12994852662086487]},
{‘sequence’: ‘It is raining.’,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.7896050214767456, 0.21039502322673798]},
{‘sequence’: “Don’t use chatGPT.”,
‘labels’: [‘incomplete sentence’, ‘well defined sentence’],
‘scores’: [0.9611700773239136, 0.03882993757724762]}]
It went terribly, always suggesting that these are incomplete sentences.
As the language model understand the natural language model, shouldnt it understand what is a well defined sentence? Or it must always been trained/finetuned for it?
Do you have suggestion? Other approaches?
Which transformers methods should I use?
Other labels (in this zero-shot classification example) could be more accurate?
I necessary need a dataset to train/fine tune?
Someone has sugestions of datasets of well defined sentences (in Brazilian portuguese)?
Thank you