Anomaly detection for News: Finding new insights while combating misinformation
– Leider bin ich in der letzten Google-DNI-Runde ebenfalls gescheitert. Fürs Logbuch… hier die Projektskizze:
Project title: Anomaly detection for News: Finding new insights while combating misinformation
From a news standpoint, anomalies are extremely interesting because unexpected events, patterns and trends are very newsworthy in the best case or a sign of error and manipulation in the worst case.
Anomaly detection is the identification of rare items, events or observations which should raise suspicion (and curiosity) by differing significantly from the majority of the data. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions. [according to Wikipedia]
Implementing anomaly detection for news would help finding new insights while combating misinformation.
Interesting is what is unexpected. Automatic anomaly detection can help identify what's the norm to flag the unexpected. The proposed tool works on a language level but mainly and most importantly on the information content, i.e. on facts hidden in deeper layers of story text.
It builds on my DNI project from round 2 (Structured News: Atomizing the news into a browsable knowledge base for structured journalism) to extract and transform all kinds of information contained in news text into normalized and highly structured data structures that intelligent algorithms are able to utilize.
The goal would be to monitor and track fast-paced news feeds and news wire for outliers, that is changes in the underlying phenomenon that is out of the ordinary but could not be observed without the comparison with larger background information that only automation can process.
The prototype would start with data types I have already worked with during the before mentioned DNI project. Reacting on the following types of changes would be my first and primary focus:
- outlier analysis for mentions of frequent terms and named entities
- unexpected statements in quoted speech
- changes in perception, i.e. opinion and sentiment
- movement in numerical data hidden inside story text
- anomaly detection for news events
Anomaly detection could be an assistent tool for the newsroom and the individual journalist. It could be a new factor in ranking news and prioritizing effort. It could be of great value to generate novel ideas for news stories and data journalism. It could be a further component in the anti-fakenews campaign.
What makes your project innovative?
Anomaly detection is well-known for time series data, e.g. fluctuating numbers over time, like stock prices. It is also known for log file analysis, e.g. network security. I have not found many examples that take natural text as raw input. Applying the ideas to the news context seems to be a relatively unexplored and therefore worthwhile idea.
To make the tool as useful as possible it has to be able to look much deeper into story text and comparison than usually possible. It has to look beyond the language level into the actual information content. But this problem has been solved to the most part during my former DNI project (Structured News: Atomizing the news into a browsable knowledge base for structured journalism) so I can build on this.
The tool will learn from history but also uses external sources like Wikipedia for background checks. To establish a kind of fact database I will use a number of natural language processing and machine learning approaches I'm already familiar with and I'd also like to try some newer deep learning techniques that seem very promising.
My goal is to deliver a service to analyze an existing article collection as a baseline to compare new incoming stories against. The tool would flag and evaluate before mentioned changes in the underlying data. An additional user interface allows the user to define criteria in different categories (s)he is particularly interested in.
Basis for the project will be 12 years worth of web news crawl data collected by my service rivva.de.
How will your Project support and stimulate innovation in digital news journalism? Why does it have an impact?
This project adds a new dimension to identify novel information buried in a news stream while at the same time flagging probably false information by looking much deeper into an article's text than usually feasible.
It should be a great addition to the repertoire of tools at the fingertips of journalists and newsrooms. It cuts and filters through news feeds, news wire and individual articles in a very special manner. It takes information theoretic measures to quantify the interestingness of news to uncover most unexpected info.
Therefore, it should be a great source for new insights, story ideas and data journalism projects. Of course it should also help flag human error or consumer manipulation.
Because the tool looks for unexpected events, patterns and trends, it plays an important role as a monitoring/alarming system. That's where anomaly detection has traditionally been used for. So, integration with other systems used for this purpose (e.g. a dashboard, Slack agent or Twitter bot) could be worthwhile.
I have only found a research paper from 2014 (http://www.aclweb.org/anthology/C14-1134) that takes part of the suggested approach. My project is much more interested in movement in data buried in news texts in forms and magnitude that only automated tasks can handle the flow of new information.