NLP advise seeked for news processing

Hi all,

I am working on a hobby project, trying to extend it with AI/ML features and came here to ask for advice.

Let me describe the project first: I do not like being online all the time, but I do like twitter and want to know about hot topics. So I have built a website which reads twitter, selects main news and presents it in a newspaper-like way.

For now, it simply lists tweets ordered by importance. Well, this is not how real newspapers look. Real newspapers have articles which group sentences (tweets) related to a particular topic.

I have tried calculating tweet embeddings using several pretrained huggingface models and evaluating tweet distance based on cosine similarity, but the results are far below what I need.


There are several challenges that I am trying to solve.

1. News similarity differs from general similarity

The following sentences are not similar in general terms, but relate to the same topic and should be included in one article:

* Ukraine's defence minister 'optimistic' that war could end this year
* Street fighting continues in Severodonetsk, says Zelenskyy

On the other hand, there can be similar sentences which are not related at all from the newspaper’s perspective (especially if named entities are skipped!):

* VW should be worried about dieselgate
* Johnson should be worried about partygate

2. Similarity depends on context

In some contexts, all the below sentences could be a part of an article about Trump, while in others they could contribute to an article about corruption (a) and another one about Capitol attack (b and c).

a) Perspective: As Watergate’s 50th anniversary nears, Woodward and Bernstein write that they thought Nixon defined corruption. Then came Trump.
b) EXCLUSIVE: Rep. Tom Rice was one of 10 House Republicans who voted to impeach former Pres. Trump for inciting the Jan. 6
c) Capitol attack panel to unveil new evidence against Trump at public hearings

3. Unknown words should not be ignored

Many NLP algorithms rely on a static dictionary. When processing a new text, they skip all unknown words. This means that locations, people’s names, and hashtags are not considered, which clearly means that crucial information is lost.

I have tried using FastText which relies on subword tokenization, but initial results were weak.

Ideally, I would like the algorithm to understand that “White House officials”, “U.S. president” and “Biden” are all related. But given the changing focus of news (think how much have we improved in Ukrainian geography recently), this is probably a bar set high.

4. Adaptable

One of the approaches could be to classify news into predefined news categories, such as local, foreign, sports, economy, showbusiness. This would allow me to use a standard classification approach.

However, the whole site is generic; besides predefined channels it can be used for a person’s home twitter timeline. Imagine a nerd who only follows technical updates on JavaScript development frameworks or an NGO worker who only follows local news for a selected region. I want the website to be relevant for them as well.

5. Multi-language

The predefined news channels are in English, German and Czech. Ideally, I would like the website to be language independent. If the model used can be multiligual, that’s great. If not, there is an alternative approach – use an online translation service to translate any tweet to English first.

Steps I have tried so far

  • I have used several huggingface models (namely all-MiniLM-L6-v2, paraphrase-multilingual-MiniLM-L12-v2 and paraphrase-xlm-r-multilingual-v1)
  • I have used several approaches from the genism library (fasttext, Latent Semantic Indexing, glove).

I have relied on existing models, but the results were not satisfactory, so it seems to me that I need to train (or customize) something myself. It is likely that I will have to do supervized training and start by collecting and classifying training data. In order to this right, I need to decide about the next direction now.

Ideas on how to continue

  1. I could train a model to translate each tweet into a list of topics. This list would typically contain locations, people’s names, hashtags, and also general terms such as war or soccer. I would then be looking for similarities across these keywords.
  2. I can classify training data to capture tweet similarities as needed for news. I.e. each test data item will be a tuple (tweet1, tweet2, similarity_score) where similarity score is between 0 and 1. Then I can train a model to calculate this score for me. This approach means resignation on 2)) but could still work reasonably well.
  3. Both above ideas assume processing tweet by tweet. Another approach could be feeding the model with a set of tweets and expecting an answer such as [0, 0, 0, 1, 1, 0, 2, 2, 1, 0] suggesting that tweets 4,5 and 9 belong to one article, while 7 and 8 belong to another one. This would make use of the model very simple, but its complexity will be much higher and will probably require a huge amount of training data.
  4. I could give up on the tweet as such and only focus on its named entities. This would allow me to use some existing NER models. So easy to implement, but probably limited performance.

I am new to AI/ML so I will appreciate any hints or comments.