Text classification of RSS articles

Hello!

I’m a software engineer with good coding skills but limited knowledge about AI. I have embarked in a simple project.

I have a large amount of RSS articles that I have read or liked. I consider these “interesting”. I then have about a gazillion unread articles. These can be interesting, but are most likely uninteresting since I haven’t read them.
My goal is, for any new article, to compute a score of interesting-ness. This will help me quickly identify the articles worth reading.

The articles range in length from 400 to 4000 tokens. I have about 5000 read/liked articles. I was tempted to take about 5000 unread articles, label them as not_important, take all my liked/read articles and label them as important. Then train a binary classifier. Something like what is described in the hugging face website: Text classification. I used distilbert/distilbert-base-uncased like in the tutorial, and followed almost exactly the steps of the tutorial.

{'loss': 0.6051, 'grad_norm': 2.22690749168396, 'learning_rate': 6.162420382165605e-06, 'epoch': 1.59}                                                       
{'eval_loss': 0.5926874279975891, 'eval_accuracy': 0.6693258875149581, 'eval_runtime': 357.0262, 'eval_samples_per_second': 7.022, 'eval_steps_per_second': 0.221, 'epoch': 2.0}                                                                                                                                          
{'train_runtime': 12047.1712, 'train_samples_per_second': 1.665, 'train_steps_per_second': 0.052, 'train_loss': 0.592256072220529, 'epoch': 2.0}

I got modest results after training.

The question I have for this forum is this one: is it the right approach and should I persevere? Should I put some effort into trying to get a better dataset (like trying to label my not_important articles better), or is there a better approach?

For example, I have also considered using the model to calculate the embeddings of all the read/liked articles and using a “traditional” algorithm like SVM to train a one class classifier, instead of a binary one.
The bottleneck to improving the accuracy of the model will be to properly label “not_important” article, if there was a way to get away with not doing that, that would be great :slight_smile:

Please let me know what you think

1 Like

Hello.

Given that it works reasonably well in practice, I think the approach is correct. There are many successor models to BERT, so it should be possible to improve accuracy using those.

Another approach that can be taken when there is little labeled data is something called Positive Unlabeled Learning

Another common approach is to use commercial AI to create a training dataset using your own data. This is almost always effective if the budget allows. However, in this case, there is already a considerable amount of data available, so it may be sufficient to process the data using Python.

Resources: