
With the weighted keywords extracted for all articles, the next step is to cluster the news into stories. On average, it needs less than 0.1s to process one article. With a 2.5GHz CPU and 8GB RAM PC, it took about 50 minutes to complete all 30k+ news articles. The keyword extraction in this approach runs quite fast. Subsequently, a simple lookup table is created to link those top keywords to their alternative names or abbreviations. For simplicity, this article builds an entity linking table by a quick check of those top keywords.Īs shown in the figure, the highlighted keywords in the same colour are referred to as the same entity. The top 100 keywords account for about 24% of all extracted keywords, while the top 500 are about 40%. The most popular keywords are shown below. There are Python libraries such as BLINK to verify the extracted entities with Wikipedia data for general purpose applications.įor this 30k+ article dataset, there are about 250k keywords extracted in total. The objective of this step is to link the entities with their alternative names or abbreviations to improve keywords similarity calculation in the next stage. In addition, the keywords which are related to news providers’ names and writing patterns are also removed, e.g., Reuters, Thomson Reuters, CNBC, source, story, Reuters story, etc. The objective of keywords filtering is to remove the unwanted words, misclassified entities, and symbols, such as stop words, date-times, month names, prepositions, adjectives, determiners, conjunctions, punctuations, emails, special symbols #, ^, *, etc. A simple formula to calculate the score of a keyword is as follows:Īs machine learning models are not perfect, there are also a small number of misclassification errors in the Spacy NLP model. An Entity keyword weighs more than a Noun Chunk in the same location. To improve the performance of keyword extraction, the following modules are presented: Keywords ScoringĬonsidering the importance of a keyword in the news, different weights are used depending on the keyword type (Entity or Noun Chunk), keyword location (title or content) and appearing times.Ī keyword appearing in the title is assigned with more weight than that in the content. Instead of spending much effort on training a deep learning model, this article uses Spacy as the base tool to extract the entities and noun chunks efficiently. It also supports the extraction of noun phrases. Using its pre-trained model, it supports fast NER (Named Entity Extraction), including most entities such as Persons, Organizations, GPE (countries, cities, states), etc. Spacy is an open-source Python library, capable of most NLP applications. The keywords are often from the named entities and noun phrases in the article. The objective is to select several keywords with term frequency to reflect the key information of the article. Keywords extraction is one of the major tasks in the pipeline. In this article, those provider related patterns are cleaned using regular expressions in Python. If those pattern phrases are not removed, they may be recognized as the keywords of the article, thereby leading to more noises to story clustering. Facebook’s Zuckerberg to testify before Congress: source - McDonald’s accused of firing worker who sued over COVID-19 claims : Bloomberg - Coty to appoint Chairman Peter Harf as its new CEO : WSJ - Siemens prepares for COVID-19 trough to last 6–9 months : CNBC For example, the Reuters news in this dataset has many articles with common patterns of following phrases or entities. Provider-specific text patternsĪ news provider may have some patterns in its articles. The non-English characters are removed simply. The majority of articles in this dataset are written in English. Text cleaning is often a domain or problem-specific task.

Those text features, if not cleaned during the early stage of the pipeline, may cause noises to downstream tasks. Online news often contains many unwanted texts, words from other languages, provider-specific patterns, etc. There are three news sources in the dataset, i.e., Reuters, The Guardian and CNBC. Each article contains the title, a short description and the publishing time. The dataset has more than 30k news headlines from the year 2018 to 2020. This article uses the financial news headline dataset from Kaggle as an example to illustrate news clustering and trending story extraction. However, news data crawling is not the major focus here. There are several interesting articles about using Scrapy to crawl news or related data. Scrapy is a popular tool to build web scrapers. It is often necessary to collect data for text analysis from internet resources. All the codes for this solution are available in my GitHub repository. The following sections will explain all these tasks and the approach in detail.
