Mein Structured-News-Experiment erhält eine Förderung durch den Google-DNI-Innovationsfonds

17. November 2016 – Freudestrahlend darf ich verkünden, dass mein Projektvorschlag tatsächlich für die zweite Runde der Digital News Initiative (DNI) ausgewählt wurde. Für alle Interessierten teile ich vorab den ersten Teil meiner Bewerbung.

Project title: Structured News: Atomizing the news into a browsable knowledge base for structured journalism

Brief overview:

Structured news are exploring a new space in news presentation and consumption.

It promotes that all news events be broken down into their component pieces and organized into a coherent repository. As tiny, contextualized bits of information these news "atoms" and "particles" will be findable, investigable and recombinable as individual units.

This in turn makes personal media feasible where a story can be customized for each reader, depending on her device, time budget and information needs, effectively being an answer to the unbundling of news.

For the news ecosystem as a whole, structured data could become a new building block enabling a Lego-like flexibility in the newsroom.

Project description:

This proposal takes much inspiration from the Living Stories project, led by Google, the New York Times and the Washington Post, and builds upon their approach to structured journalism.

A living story can be thought of as a continuously updated news resource that is capable to react to multifaceted story development given varying information preferences. This is made possible by treating journalistic content as structured data and structured data as journalistic content.

By "atomizing the news" we will be transforming a news archive into a fine-grained web of journalistic assets to be repurposed in different and new contexts. Technically a number of algorithms will split our text corpus into small, semantic chunks, be they a name, location, date, numeric fact, citation or some such concept. These "atomic news particles" will then get identified, refined and put into optimal storage format, involving tasks such as information extraction, named entity recognition and resolution.

For the seminal living stories experiment all stories had to be labeled by hand. This prototype project in contrast will try the automation route. Ideally these approaches would be blended to a hybrid form with more editorial input.

Key deliverable will be the structured journalism repository accumulated over time with all information organized around the people, places, organizations, products etc. named within news stories, facts about these entities, relationships between them and their role with respect to news events and story developments.

To make this knowledge base easily browsable I'd like to propose a faceted search user interface. Faceted search allows users to explore a multi-dimensional information space by combining search queries with multiple filters to drill down along various dimensions and is therefore an optimal navigation vehicle for all our purposes.

Specific outcome:

On the publishers' side, the proposed infrastructure would help build up newsroom memory, maximize the shelf life of content and provide the ultimate building blocks for novel news offerings and experiments. It must be emphasized that any news business created out of structured data is virtually safe from content theft because its user experience cannot be replicated without also copying the entire underlying database.

On the consumers' side, through structured journalism today's push model of news effectively turns into more of a pull, on-demand model. Up-to-date information is increasingly sought out exactly when it is needed and in just the right detail, not necessarily when it's freshly published nor in a one-size-fits-all news package. Essentially this implies transferring control over content from publishers to consumers. Product innovation on the users' behalf would be completely decoupled from innovation and experimentation in the newsroom.

Broader impact:

For news consumers I could see two major implications in user experience, honoring the readers' time and tending to their curiosity:

Today readers who have been following the news are confronted with lots of redundant context duplicated across articles whereas new readers are given too little background. In the programming community we have a famous acronym: DRY! It stands for "don't repeat yourself" and is stated as "every piece of knowledge must have a single, unambiguous, authoritative representation within a system." DRY represents one of the core principles that makes our code readable, reusable and maintainable. Applied to journalism I have high hope it might reap the same benefits.

The second implication I would call "just-in-time information". It means that information is pulled, not pushed, so that the reader can decide for herself how to consume the content. Choosing just the highlights or just the updates? Or following a specific event or topic? Or slicing and dicing through the whole news archive? It all requires more structure. Atomized news organize the information around structure.

As for a broader impact on the news ecosystem I could see more ideas of integrated software development environments be applied to the news editing process:

For instance, for several decades source code was merely looked at as a blob of characters. Only in the last 15 years our source code editors started parsing our program code while we type, understanding the meaning in a string of tokens, giving direct feedback. We share the same major raw materials, namely text, so the same potential lies in journalism tools. Just imagine what will happen if we stop squeezing articles into just a few database columns but save as much information about a story as we like? Would increased modularity in reporting bring the same qualities to journalism that developers value so much in code, like reuse, refactoring, versioning and possibly even open source? I would hope that the approach of structured news will inspire more explorations in these directions.

What makes your project innovative?

This project will supply a prototype infrastructure for structured journalism.

Because I am not a content provider myself this project would be transformative to me if I could become a technology provider in the respected field. My goal is to classify approximately three million web pages, archived since 2007 by my own web crawler, into an ever richer network of structured stories. This repository then could establish a playground to evaluate the ideas described and implicated.

Advanced natural language understanding will be most crucial to the problem. This project would help me familiarize myself more with state-of-the-art deep learning models like word vector and paragraph vector representations as well as long-short-term-memory neural networks.

The technology built for this project will mainly include a streaming data processing pipeline for several natural language processing and machine learning tasks, including information extraction, recognition and resolution.

Key deliverables will be the structured journalism repository and faceted news browser mentioned before and in the project description.

It's essential that this structured news browser be intuitive and useful to readers without mastering advanced search options. Different levels of detail cater to readers with different levels of interest. So ideally the final product should remind users somehow of a very flexible Wikipedia article. Imagine a single page with a story summary and stream of highlighted updates. All content is organized and filterable by named people, places, events and so on. Every piece of content is weighted by importance and remembers what you have already read and which topics you are really interested in.

Although international teams are experimenting on the very same frontier, the German language poses some unique problems in text mining and therefore bears overlapping efforts.

How will your Project support and stimulate innovation in digital news journalism? Why does it have an impact?

Because of its Lego-like flexibility structured data is the ultimate building block, enabling quick experimentation and novel news products.

It establishes a new kind of market place ripe for intensive collaboration and teamwork. Jeff Jarvis' rule "Do your best, link to the rest" could put a supply chain in motion in which more reuse, syndication and communication takes place both within a single news organization and across the industry. Just imagine a shared news repository taking inspiration from the open source culture in development communities like GitHub.

A shared knowledge base of fact-checked microcontent initially would result in topic pages and info boxes being more efficient, therefore maximizing the investments in today's news archives. Similarly, structure acts as an enabling technology on countless other fronts, be it personalization, diversification, summarization on readers' behalf or process journalism, data journalism, robot journalism for the profession.

Ich suche einen neuen Sponsor

17. Mai 2016 – Nach 3+ Jahren Unterstützung durch die Süddeutsche Zeitung suche ich zu Mitte des Jahres einen neuen Partner.

Bei Interesse (vielleicht auch an unabhängigen Projekten), bitte gerne melden unter frank.westphal@gmail.com

Zuletzt möchte ich mich bei Stefan Plöchinger und Team für die großartige Zusammenarbeit bedanken.

Living Stories (continued)

24. Februar 2016 – Having applied to the Digital News Initiative Innovation Fund with no success, I'm posting my project proposal here in hope for a wider audience. If you are interested in atomized news and structured journalism and like to exchange ideas and implementation patterns, please send me an email.

Project title: Living Stories (continued)

Brief overview:

With this proposal, I'd like to follow up on the Living Stories project, led by Google, the New York Times and the Washington Post, and build upon its approach to structured journalism.

A living story can be thought of as a continuously updated news resource that is capable to react to multifaceted story development given varying information preferences. It's like a Wikipedia where each and every word knows exactly whether it is a name, place, date, numeric fact, citation or some such concept. This "atomization of news" breaks a corpus of articles down into a fine-grained web of journalistic assets to be repurposed in different and new contexts. This in turn makes personal media feasible where a story can be customized for each reader, depending on her device, time budget and information needs, effectively being an answer to the unbundling of news.

Combining the latest natural language processing and machine learning algorithms, I'd love to build the technical infrastructure to automate these tasks. My proof of concept would turn nine years worth of crawled web data into a rich network of living stories. If successful, microservice APIs will be offered for paid and public use.

Detailed description:

Living stories are exploring a new space in news presentation and consumption.

To refresh our memories what a living story actually was, I'll quickly summarize: It's a single-page web app, with a story summary and stream of updates, where all content is organized and filterable by named people, places, events, and so on. Different levels of detail cater to readers with different levels of interest, so every piece of content is weighted by importance and remembers what you have already read.

I'd like to highlight just two outcomes: (i) the DRY principle ("don't repeat yourself") says to honor the readers' time, and (ii) just-in-time information says to tend to the readers' curiosity.

Today, readers who have been following the news are confronted with lots of redundant context duplicated across articles, whereas new readers are given too little background. In the programming community, we have a famous acronym: DRY! It stands for "don't repeat yourself" and is stated as "every piece of knowledge must have a single, unambiguous, authoritative representation within a system." DRY represents one of the core principles that makes code readable, reusable and maintainable. Applied to journalism, it might reap the same benefits.

The second idea is called just-in-time information. It means that information is pulled, not pushed, so that the reader can decide for herself how to consume the content. Choosing just the highlights or just the updates, or following a specific event or topic, or slicing and dicing through the whole news archive, requires structure. Living stories organize the information around structure.

What makes your project innovative?

In many ways, this project is merely applying principles of modern software development, plus ideas out of lean production by Toyota, to the value stream of news organizations.

While both disciplines work with text as their major raw materials, we don't share the same powerful tools and processes yet. For example, why do news articles get squeezed into just a few database fields (i.a. headline, text, author, timestamp) when we could imagine so many more attributes for each story? What will happen if we stop handling articles as mere blobs of characters, but parse them like source code? Would increased modularity in reporting bring the same qualities to journalism that developers value so much in code, like reuse, refactoring, versioning, and possibly even open source?

For the seminal living stories experiment held in 2010, all data seems to have been crafted by hand, a librarian's job. This project however will apply computer science to the task. Ideally, these approaches would be blended to a hybrid form with more editorial input.

The technology built for this project will include a streaming data processing pipeline for information extraction, recognition and resolution. Advanced natural language understanding will be most crucial to the problem, that's why I'd love to gain more experience with state-of-the-art deep learning models like recurrent, recursive, convolutional, and especially long-short-term-memory neural networks, as well as word vector and paragraph vector representations.

My goal is to classify approximately three million web pages, archived since 2007 by Rivva's web crawler, into living stories. Deliverables will include a RESTful hypermedia API where there is a URL for everything and its relations, both browsable for humans as well as machine-readable. Also, the APIs of internally used microservices will be released, so that developers can then build their own applications.

On the publishers' side, the proposed technology stack would help build up newsroom memory, maximize the shelf life of content, and provide the ultimate building blocks for novel news offerings and experiments. It must be emphasized that any news business created out of structured data is virtually safe from content theft, because its user experience cannot be replicated without also copying the entire underlying database.

On the consumers' side, through structured journalism, today's push model of news effectively turns into more of a pull, on-demand model. Up-to-date information is increasingly sought out exactly when it is needed and in just the right detail, not necessarily when it's freshly published nor in a one-size-fits-all news package. Essentially, this implies transferring control over content from publishers to consumers. Product innovation on the users' behalf would be completely decoupled from innovation and experimentation in the newsroom.

Competition:

Adrian Holovaty's work on chicagocrime.org is the first example I remember combining data, code and journalism in an innovative way.

Truely pathbreaking was the Living Stories effort by Google Labs, the New York Times and the Washington Post. It's unclear to me why its cutting edge approach has been discontinued so soon, or in the meantime not even been taken up by someone else.

Circa News was regarded as a front-runner in "atomized news", but shutdown this year due to lack of funding. Circa was breaking out of the traditional article format and branching out into an update stream with facts, statistics, quotes and images representing the atomic level to each story.

PolitiFact is another good demonstration of structured news, which won them the Pulitzer price in 2009 for fact-checking day-to-day claims made in US politics.

On the extreme end of the spectrum is Structured Stories. This approach is so highly structured and thus affords so much manual labour that I personally can't see how it would scale to the work pace inside newsrooms.

Recently, the BBC, the New York Times, the Boston Globe, the Washington Post, and possibly even more news labs, all have announced both experimental prototypes as well as new projects on the way, with the BBC being the most prolific (Ontology, Linked Data) and the New York Times being the most innovative (Editor, Particles).

References:

A fundamental way newspaper sites need to change (holovaty.com)
Atomizing mobile news: Borrowing a page from Circa (knightdigitalmediacenter.org)
Storylines as data in BBC News: notes on implementing the news storyline ontology (medium.com)
Structured journalism : the next revolution in storytelling ? (medium.com)
One thing we can learn from Circa: A broader way to think about structured news (niemanlab.org)
The Future of News Is Not An Article (nytlabs.com)
How newsrooms are using machine learning to make journalists’ lives easier (poynter.org)
Structured Journalism – ein Organisationsprinzip (qundg.de)
Structured journalism puts consumers in control of news (rjionline.org)
Structured Journalism (structureofnews.wordpress.com)
Assets, Objects, Points: Was Structured Journalism bringen kann (voicerepublic.com)