Living Stories (continued)

Having applied to the Digital News Initiative Innovation Fund with no success, I'm posting my project proposal here in hope for a wider audience. If you are interested in atomized news and structured journalism and like to exchange ideas and implementation patterns, please send me an email.

Project title: Living Stories (continued)

Brief overview:

With this proposal, I'd like to follow up on the Living Stories project, led by Google, the New York Times and the Washington Post, and build upon its approach to structured journalism.

A living story can be thought of as a continuously updated news resource that is capable to react to multifaceted story development given varying information preferences. It's like a Wikipedia where each and every word knows exactly whether it is a name, place, date, numeric fact, citation or some such concept. This "atomization of news" breaks a corpus of articles down into a fine-grained web of journalistic assets to be repurposed in different and new contexts. This in turn makes personal media feasible where a story can be customized for each reader, depending on her device, time budget and information needs, effectively being an answer to the unbundling of news.

Combining the latest natural language processing and machine learning algorithms, I'd love to build the technical infrastructure to automate these tasks. My proof of concept would turn nine years worth of crawled web data into a rich network of living stories. If successful, microservice APIs will be offered for paid and public use.

Detailed description:

Living stories are exploring a new space in news presentation and consumption.

To refresh our memories what a living story actually was, I'll quickly summarize: It's a single-page web app, with a story summary and stream of updates, where all content is organized and filterable by named people, places, events, and so on. Different levels of detail cater to readers with different levels of interest, so every piece of content is weighted by importance and remembers what you have already read.

I'd like to highlight just two outcomes: (i) the DRY principle ("don't repeat yourself") says to honor the readers' time, and (ii) just-in-time information says to tend to the readers' curiosity.

Today, readers who have been following the news are confronted with lots of redundant context duplicated across articles, whereas new readers are given too little background. In the programming community, we have a famous acronym: DRY! It stands for "don't repeat yourself" and is stated as "every piece of knowledge must have a single, unambiguous, authoritative representation within a system." DRY represents one of the core principles that makes code readable, reusable and maintainable. Applied to journalism, it might reap the same benefits.

The second idea is called just-in-time information. It means that information is pulled, not pushed, so that the reader can decide for herself how to consume the content. Choosing just the highlights or just the updates, or following a specific event or topic, or slicing and dicing through the whole news archive, requires structure. Living stories organize the information around structure.

What makes your project innovative?

In many ways, this project is merely applying principles of modern software development, plus ideas out of lean production by Toyota, to the value stream of news organizations.

While both disciplines work with text as their major raw materials, we don't share the same powerful tools and processes yet. For example, why do news articles get squeezed into just a few database fields (i.a. headline, text, author, timestamp) when we could imagine so many more attributes for each story? What will happen if we stop handling articles as mere blobs of characters, but parse them like source code? Would increased modularity in reporting bring the same qualities to journalism that developers value so much in code, like reuse, refactoring, versioning, and possibly even open source?

For the seminal living stories experiment held in 2010, all data seems to have been crafted by hand, a librarian's job. This project however will apply computer science to the task. Ideally, these approaches would be blended to a hybrid form with more editorial input.

The technology built for this project will include a streaming data processing pipeline for information extraction, recognition and resolution. Advanced natural language understanding will be most crucial to the problem, that's why I'd love to gain more experience with state-of-the-art deep learning models like recurrent, recursive, convolutional, and especially long-short-term-memory neural networks, as well as word vector and paragraph vector representations.

My goal is to classify approximately three million web pages, archived since 2007 by Rivva's web crawler, into living stories. Deliverables will include a RESTful hypermedia API where there is a URL for everything and its relations, both browsable for humans as well as machine-readable. Also, the APIs of internally used microservices will be released, so that developers can then build their own applications.

On the publishers' side, the proposed technology stack would help build up newsroom memory, maximize the shelf life of content, and provide the ultimate building blocks for novel news offerings and experiments. It must be emphasized that any news business created out of structured data is virtually safe from content theft, because its user experience cannot be replicated without also copying the entire underlying database.

On the consumers' side, through structured journalism, today's push model of news effectively turns into more of a pull, on-demand model. Up-to-date information is increasingly sought out exactly when it is needed and in just the right detail, not necessarily when it's freshly published nor in a one-size-fits-all news package. Essentially, this implies transferring control over content from publishers to consumers. Product innovation on the users' behalf would be completely decoupled from innovation and experimentation in the newsroom.


Adrian Holovaty's work on chicagocrime.org is the first example I remember combining data, code and journalism in an innovative way.

Truely pathbreaking was the Living Stories effort by Google Labs, the New York Times and the Washington Post. It's unclear to me why its cutting edge approach has been discontinued so soon, or in the meantime not even been taken up by someone else.

Circa News was regarded as a front-runner in "atomized news", but shutdown this year due to lack of funding. Circa was breaking out of the traditional article format and branching out into an update stream with facts, statistics, quotes and images representing the atomic level to each story.

PolitiFact is another good demonstration of structured news, which won them the Pulitzer price in 2009 for fact-checking day-to-day claims made in US politics.

On the extreme end of the spectrum is Structured Stories. This approach is so highly structured and thus affords so much manual labour that I personally can't see how it would scale to the work pace inside newsrooms.

Recently, the BBC, the New York Times, the Boston Globe, the Washington Post, and possibly even more news labs, all have announced both experimental prototypes as well as new projects on the way, with the BBC being the most prolific (Ontology, Linked Data) and the New York Times being the most innovative (Editor, Particles).



Archiv: 2021 (2)   2020 (1)   2019 (2)   2018 (5)   2017 (3)   2016 (3)   2015 (1)   2014 (2)   2013 (8)   2012 (11)   2011 (4)   2010 (8)   2009 (18)   2008 (12)   2007 (17)   alle (97)