Hello!
I’m a software engineer with good coding skills but limited knowledge about AI. I have embarked in a simple project.
I have a large amount of RSS articles that I have read or liked. I consider these “interesting”. I then have about a gazillion unread articles. These can be interesting, but are most likely uninteresting since I haven’t read them.
My goal is, for any new article, to compute a score of interesting-ness. This will help me quickly identify the articles worth reading.
The articles range in length from 400 to 4000 tokens. I have about 5000 read/liked articles. I was tempted to take about 5000 unread articles, label them as not_important, take all my liked/read articles and label them as important. Then train a binary classifier. Something like what is described in the hugging face website: Text classification. I used distilbert/distilbert-base-uncased
like in the tutorial, and followed almost exactly the steps of the tutorial.
{'loss': 0.6051, 'grad_norm': 2.22690749168396, 'learning_rate': 6.162420382165605e-06, 'epoch': 1.59}
{'eval_loss': 0.5926874279975891, 'eval_accuracy': 0.6693258875149581, 'eval_runtime': 357.0262, 'eval_samples_per_second': 7.022, 'eval_steps_per_second': 0.221, 'epoch': 2.0}
{'train_runtime': 12047.1712, 'train_samples_per_second': 1.665, 'train_steps_per_second': 0.052, 'train_loss': 0.592256072220529, 'epoch': 2.0}
I got modest results after training.
The question I have for this forum is this one: is it the right approach and should I persevere? Should I put some effort into trying to get a better dataset (like trying to label my not_important articles better), or is there a better approach?
For example, I have also considered using the model to calculate the embeddings of all the read/liked articles and using a “traditional” algorithm like SVM to train a one class classifier, instead of a binary one.
The bottleneck to improving the accuracy of the model will be to properly label “not_important” article, if there was a way to get away with not doing that, that would be great
Please let me know what you think