Mein Structured-News-Experiment erhält eine Förderung durch den Google-DNI-Innovationsfonds
– Freudestrahlend darf ich verkünden, dass mein Projektvorschlag tatsächlich für die zweite Runde der Digital News Initiative (DNI) ausgewählt wurde. Für alle Interessierten teile ich vorab den ersten Teil meiner Bewerbung.
Project title: Structured News: Atomizing the news into a browsable knowledge base for structured journalism
Structured news are exploring a new space in news presentation and consumption.
It promotes that all news events be broken down into their component pieces and organized into a coherent repository. As tiny, contextualized bits of information these news "atoms" and "particles" will be findable, investigable and recombinable as individual units.
This in turn makes personal media feasible where a story can be customized for each reader, depending on her device, time budget and information needs, effectively being an answer to the unbundling of news.
For the news ecosystem as a whole, structured data could become a new building block enabling a Lego-like flexibility in the newsroom.
This proposal takes much inspiration from the Living Stories project, led by Google, the New York Times and the Washington Post, and builds upon their approach to structured journalism.
A living story can be thought of as a continuously updated news resource that is capable to react to multifaceted story development given varying information preferences. This is made possible by treating journalistic content as structured data and structured data as journalistic content.
By "atomizing the news" we will be transforming a news archive into a fine-grained web of journalistic assets to be repurposed in different and new contexts. Technically a number of algorithms will split our text corpus into small, semantic chunks, be they a name, location, date, numeric fact, citation or some such concept. These "atomic news particles" will then get identified, refined and put into optimal storage format, involving tasks such as information extraction, named entity recognition and resolution.
For the seminal living stories experiment all stories had to be labeled by hand. This prototype project in contrast will try the automation route. Ideally these approaches would be blended to a hybrid form with more editorial input.
Key deliverable will be the structured journalism repository accumulated over time with all information organized around the people, places, organizations, products etc. named within news stories, facts about these entities, relationships between them and their role with respect to news events and story developments.
To make this knowledge base easily browsable I'd like to propose a faceted search user interface. Faceted search allows users to explore a multi-dimensional information space by combining search queries with multiple filters to drill down along various dimensions and is therefore an optimal navigation vehicle for all our purposes.
On the publishers' side, the proposed infrastructure would help build up newsroom memory, maximize the shelf life of content and provide the ultimate building blocks for novel news offerings and experiments. It must be emphasized that any news business created out of structured data is virtually safe from content theft because its user experience cannot be replicated without also copying the entire underlying database.
On the consumers' side, through structured journalism today's push model of news effectively turns into more of a pull, on-demand model. Up-to-date information is increasingly sought out exactly when it is needed and in just the right detail, not necessarily when it's freshly published nor in a one-size-fits-all news package. Essentially this implies transferring control over content from publishers to consumers. Product innovation on the users' behalf would be completely decoupled from innovation and experimentation in the newsroom.
For news consumers I could see two major implications in user experience, honoring the readers' time and tending to their curiosity:
Today readers who have been following the news are confronted with lots of redundant context duplicated across articles whereas new readers are given too little background. In the programming community we have a famous acronym: DRY! It stands for "don't repeat yourself" and is stated as "every piece of knowledge must have a single, unambiguous, authoritative representation within a system." DRY represents one of the core principles that makes our code readable, reusable and maintainable. Applied to journalism I have high hope it might reap the same benefits.
The second implication I would call "just-in-time information". It means that information is pulled, not pushed, so that the reader can decide for herself how to consume the content. Choosing just the highlights or just the updates? Or following a specific event or topic? Or slicing and dicing through the whole news archive? It all requires more structure. Atomized news organize the information around structure.
As for a broader impact on the news ecosystem I could see more ideas of integrated software development environments be applied to the news editing process:
For instance, for several decades source code was merely looked at as a blob of characters. Only in the last 15 years our source code editors started parsing our program code while we type, understanding the meaning in a string of tokens, giving direct feedback. We share the same major raw materials, namely text, so the same potential lies in journalism tools. Just imagine what will happen if we stop squeezing articles into just a few database columns but save as much information about a story as we like? Would increased modularity in reporting bring the same qualities to journalism that developers value so much in code, like reuse, refactoring, versioning and possibly even open source? I would hope that the approach of structured news will inspire more explorations in these directions.
What makes your project innovative?
This project will supply a prototype infrastructure for structured journalism.
Because I am not a content provider myself this project would be transformative to me if I could become a technology provider in the respected field. My goal is to classify approximately three million web pages, archived since 2007 by my own web crawler, into an ever richer network of structured stories. This repository then could establish a playground to evaluate the ideas described and implicated.
Advanced natural language understanding will be most crucial to the problem. This project would help me familiarize myself more with state-of-the-art deep learning models like word vector and paragraph vector representations as well as long-short-term-memory neural networks.
The technology built for this project will mainly include a streaming data processing pipeline for several natural language processing and machine learning tasks, including information extraction, recognition and resolution.
Key deliverables will be the structured journalism repository and faceted news browser mentioned before and in the project description.
It's essential that this structured news browser be intuitive and useful to readers without mastering advanced search options. Different levels of detail cater to readers with different levels of interest. So ideally the final product should remind users somehow of a very flexible Wikipedia article. Imagine a single page with a story summary and stream of highlighted updates. All content is organized and filterable by named people, places, events and so on. Every piece of content is weighted by importance and remembers what you have already read and which topics you are really interested in.
Although international teams are experimenting on the very same frontier, the German language poses some unique problems in text mining and therefore bears overlapping efforts.
How will your Project support and stimulate innovation in digital news journalism? Why does it have an impact?
Because of its Lego-like flexibility structured data is the ultimate building block, enabling quick experimentation and novel news products.
It establishes a new kind of market place ripe for intensive collaboration and teamwork. Jeff Jarvis' rule "Do your best, link to the rest" could put a supply chain in motion in which more reuse, syndication and communication takes place both within a single news organization and across the industry. Just imagine a shared news repository taking inspiration from the open source culture in development communities like GitHub.
A shared knowledge base of fact-checked microcontent initially would result in topic pages and info boxes being more efficient, therefore maximizing the investments in today's news archives. Similarly, structure acts as an enabling technology on countless other fronts, be it personalization, diversification, summarization on readers' behalf or process journalism, data journalism, robot journalism for the profession.