Rivva-Logo

Mein Structured-News-Experiment erhält eine Förderung durch den Google-DNI-Innovationsfonds

Freudestrahlend darf ich verkünden, dass mein Projektvorschlag tatsächlich für die zweite Runde der Digital News Initiative (DNI) ausgewählt wurde. Für alle Interessierten teile ich vorab den ersten Teil meiner Bewerbung.

Project title: Structured News: Atomizing the news into a browsable knowledge base for structured journalism

Brief overview:

Structured news are exploring a new space in news presentation and consumption.

It promotes that all news events be broken down into their component pieces and organized into a coherent repository. As tiny, contextualized bits of information these news "atoms" and "particles" will be findable, investigable and recombinable as individual units.

This in turn makes personal media feasible where a story can be customized for each reader, depending on her device, time budget and information needs, effectively being an answer to the unbundling of news.

For the news ecosystem as a whole, structured data could become a new building block enabling a Lego-like flexibility in the newsroom.

Project description:

This proposal takes much inspiration from the Living Stories project, led by Google, the New York Times and the Washington Post, and builds upon their approach to structured journalism.

A living story can be thought of as a continuously updated news resource that is capable to react to multifaceted story development given varying information preferences. This is made possible by treating journalistic content as structured data and structured data as journalistic content.

By "atomizing the news" we will be transforming a news archive into a fine-grained web of journalistic assets to be repurposed in different and new contexts. Technically a number of algorithms will split our text corpus into small, semantic chunks, be they a name, location, date, numeric fact, citation or some such concept. These "atomic news particles" will then get identified, refined and put into optimal storage format, involving tasks such as information extraction, named entity recognition and resolution.

For the seminal living stories experiment all stories had to be labeled by hand. This prototype project in contrast will try the automation route. Ideally these approaches would be blended to a hybrid form with more editorial input.

Key deliverable will be the structured journalism repository accumulated over time with all information organized around the people, places, organizations, products etc. named within news stories, facts about these entities, relationships between them and their role with respect to news events and story developments.

To make this knowledge base easily browsable I'd like to propose a faceted search user interface. Faceted search allows users to explore a multi-dimensional information space by combining search queries with multiple filters to drill down along various dimensions and is therefore an optimal navigation vehicle for all our purposes.

Specific outcome:

On the publishers' side, the proposed infrastructure would help build up newsroom memory, maximize the shelf life of content and provide the ultimate building blocks for novel news offerings and experiments. It must be emphasized that any news business created out of structured data is virtually safe from content theft because its user experience cannot be replicated without also copying the entire underlying database.

On the consumers' side, through structured journalism today's push model of news effectively turns into more of a pull, on-demand model. Up-to-date information is increasingly sought out exactly when it is needed and in just the right detail, not necessarily when it's freshly published nor in a one-size-fits-all news package. Essentially this implies transferring control over content from publishers to consumers. Product innovation on the users' behalf would be completely decoupled from innovation and experimentation in the newsroom.

Broader impact:

For news consumers I could see two major implications in user experience, honoring the readers' time and tending to their curiosity:

Today readers who have been following the news are confronted with lots of redundant context duplicated across articles whereas new readers are given too little background. In the programming community we have a famous acronym: DRY! It stands for "don't repeat yourself" and is stated as "every piece of knowledge must have a single, unambiguous, authoritative representation within a system." DRY represents one of the core principles that makes our code readable, reusable and maintainable. Applied to journalism I have high hope it might reap the same benefits.

The second implication I would call "just-in-time information". It means that information is pulled, not pushed, so that the reader can decide for herself how to consume the content. Choosing just the highlights or just the updates? Or following a specific event or topic? Or slicing and dicing through the whole news archive? It all requires more structure. Atomized news organize the information around structure.

As for a broader impact on the news ecosystem I could see more ideas of integrated software development environments be applied to the news editing process:

For instance, for several decades source code was merely looked at as a blob of characters. Only in the last 15 years our source code editors started parsing our program code while we type, understanding the meaning in a string of tokens, giving direct feedback. We share the same major raw materials, namely text, so the same potential lies in journalism tools. Just imagine what will happen if we stop squeezing articles into just a few database columns but save as much information about a story as we like? Would increased modularity in reporting bring the same qualities to journalism that developers value so much in code, like reuse, refactoring, versioning and possibly even open source? I would hope that the approach of structured news will inspire more explorations in these directions.

What makes your project innovative?

This project will supply a prototype infrastructure for structured journalism.

Because I am not a content provider myself this project would be transformative to me if I could become a technology provider in the respected field. My goal is to classify approximately three million web pages, archived since 2007 by my own web crawler, into an ever richer network of structured stories. This repository then could establish a playground to evaluate the ideas described and implicated.

Advanced natural language understanding will be most crucial to the problem. This project would help me familiarize myself more with state-of-the-art deep learning models like word vector and paragraph vector representations as well as long-short-term-memory neural networks.

The technology built for this project will mainly include a streaming data processing pipeline for several natural language processing and machine learning tasks, including information extraction, recognition and resolution.

Key deliverables will be the structured journalism repository and faceted news browser mentioned before and in the project description.

It's essential that this structured news browser be intuitive and useful to readers without mastering advanced search options. Different levels of detail cater to readers with different levels of interest. So ideally the final product should remind users somehow of a very flexible Wikipedia article. Imagine a single page with a story summary and stream of highlighted updates. All content is organized and filterable by named people, places, events and so on. Every piece of content is weighted by importance and remembers what you have already read and which topics you are really interested in.

Although international teams are experimenting on the very same frontier, the German language poses some unique problems in text mining and therefore bears overlapping efforts.

How will your Project support and stimulate innovation in digital news journalism? Why does it have an impact?

Because of its Lego-like flexibility structured data is the ultimate building block, enabling quick experimentation and novel news products.

It establishes a new kind of market place ripe for intensive collaboration and teamwork. Jeff Jarvis' rule "Do your best, link to the rest" could put a supply chain in motion in which more reuse, syndication and communication takes place both within a single news organization and across the industry. Just imagine a shared news repository taking inspiration from the open source culture in development communities like GitHub.

A shared knowledge base of fact-checked microcontent initially would result in topic pages and info boxes being more efficient, therefore maximizing the investments in today's news archives. Similarly, structure acts as an enabling technology on countless other fronts, be it personalization, diversification, summarization on readers' behalf or process journalism, data journalism, robot journalism for the profession.

3 Kommentare

Ich suche einen neuen Sponsor

Nach 3+ Jahren Unterstützung durch die Süddeutsche Zeitung suche ich zu Mitte des Jahres einen neuen Partner.

Bei Interesse (vielleicht auch an unabhängigen Projekten), bitte gerne melden unter frank.westphal@gmail.com

Zuletzt möchte ich mich bei Stefan Plöchinger und Team für die großartige Zusammenarbeit bedanken.

22 Kommentare

Living Stories (continued)

Having applied to the Digital News Initiative Innovation Fund with no success, I'm posting my project proposal here in hope for a wider audience. If you are interested in atomized news and structured journalism and like to exchange ideas and implementation patterns, please send me an email.

Project title: Living Stories (continued)

Brief overview:

With this proposal, I'd like to follow up on the Living Stories project, led by Google, the New York Times and the Washington Post, and build upon its approach to structured journalism.

A living story can be thought of as a continuously updated news resource that is capable to react to multifaceted story development given varying information preferences. It's like a Wikipedia where each and every word knows exactly whether it is a name, place, date, numeric fact, citation or some such concept. This "atomization of news" breaks a corpus of articles down into a fine-grained web of journalistic assets to be repurposed in different and new contexts. This in turn makes personal media feasible where a story can be customized for each reader, depending on her device, time budget and information needs, effectively being an answer to the unbundling of news.

Combining the latest natural language processing and machine learning algorithms, I'd love to build the technical infrastructure to automate these tasks. My proof of concept would turn nine years worth of crawled web data into a rich network of living stories. If successful, microservice APIs will be offered for paid and public use.

Detailed description:

Living stories are exploring a new space in news presentation and consumption.

To refresh our memories what a living story actually was, I'll quickly summarize: It's a single-page web app, with a story summary and stream of updates, where all content is organized and filterable by named people, places, events, and so on. Different levels of detail cater to readers with different levels of interest, so every piece of content is weighted by importance and remembers what you have already read.

I'd like to highlight just two outcomes: (i) the DRY principle ("don't repeat yourself") says to honor the readers' time, and (ii) just-in-time information says to tend to the readers' curiosity.

Today, readers who have been following the news are confronted with lots of redundant context duplicated across articles, whereas new readers are given too little background. In the programming community, we have a famous acronym: DRY! It stands for "don't repeat yourself" and is stated as "every piece of knowledge must have a single, unambiguous, authoritative representation within a system." DRY represents one of the core principles that makes code readable, reusable and maintainable. Applied to journalism, it might reap the same benefits.

The second idea is called just-in-time information. It means that information is pulled, not pushed, so that the reader can decide for herself how to consume the content. Choosing just the highlights or just the updates, or following a specific event or topic, or slicing and dicing through the whole news archive, requires structure. Living stories organize the information around structure.

What makes your project innovative?

In many ways, this project is merely applying principles of modern software development, plus ideas out of lean production by Toyota, to the value stream of news organizations.

While both disciplines work with text as their major raw materials, we don't share the same powerful tools and processes yet. For example, why do news articles get squeezed into just a few database fields (i.a. headline, text, author, timestamp) when we could imagine so many more attributes for each story? What will happen if we stop handling articles as mere blobs of characters, but parse them like source code? Would increased modularity in reporting bring the same qualities to journalism that developers value so much in code, like reuse, refactoring, versioning, and possibly even open source?

For the seminal living stories experiment held in 2010, all data seems to have been crafted by hand, a librarian's job. This project however will apply computer science to the task. Ideally, these approaches would be blended to a hybrid form with more editorial input.

The technology built for this project will include a streaming data processing pipeline for information extraction, recognition and resolution. Advanced natural language understanding will be most crucial to the problem, that's why I'd love to gain more experience with state-of-the-art deep learning models like recurrent, recursive, convolutional, and especially long-short-term-memory neural networks, as well as word vector and paragraph vector representations.

My goal is to classify approximately three million web pages, archived since 2007 by Rivva's web crawler, into living stories. Deliverables will include a RESTful hypermedia API where there is a URL for everything and its relations, both browsable for humans as well as machine-readable. Also, the APIs of internally used microservices will be released, so that developers can then build their own applications.

On the publishers' side, the proposed technology stack would help build up newsroom memory, maximize the shelf life of content, and provide the ultimate building blocks for novel news offerings and experiments. It must be emphasized that any news business created out of structured data is virtually safe from content theft, because its user experience cannot be replicated without also copying the entire underlying database.

On the consumers' side, through structured journalism, today's push model of news effectively turns into more of a pull, on-demand model. Up-to-date information is increasingly sought out exactly when it is needed and in just the right detail, not necessarily when it's freshly published nor in a one-size-fits-all news package. Essentially, this implies transferring control over content from publishers to consumers. Product innovation on the users' behalf would be completely decoupled from innovation and experimentation in the newsroom.

Competition:

Adrian Holovaty's work on chicagocrime.org is the first example I remember combining data, code and journalism in an innovative way.

Truely pathbreaking was the Living Stories effort by Google Labs, the New York Times and the Washington Post. It's unclear to me why its cutting edge approach has been discontinued so soon, or in the meantime not even been taken up by someone else.

Circa News was regarded as a front-runner in "atomized news", but shutdown this year due to lack of funding. Circa was breaking out of the traditional article format and branching out into an update stream with facts, statistics, quotes and images representing the atomic level to each story.

PolitiFact is another good demonstration of structured news, which won them the Pulitzer price in 2009 for fact-checking day-to-day claims made in US politics.

On the extreme end of the spectrum is Structured Stories. This approach is so highly structured and thus affords so much manual labour that I personally can't see how it would scale to the work pace inside newsrooms.

Recently, the BBC, the New York Times, the Boston Globe, the Washington Post, and possibly even more news labs, all have announced both experimental prototypes as well as new projects on the way, with the BBC being the most prolific (Ontology, Linked Data) and the New York Times being the most innovative (Editor, Particles).

References:

0 Kommentare

Recht auf Vergessenwerden

Jetzt unterstützt Rivva auch das Recht auf Vergessenwerden, anfangs nur für Twitter- und Facebook-Inhalte, später ebenfalls für Blogs und Nachrichtenseiten.

Ob die EuGH-Entscheidung aus dem letzten Jahr bei Diensten wie Rivva buchstäblich zur Anwendung käme, ist bislang eigentlich offen. Rivva legt die Vorgaben daher wie folgt aus:

Nutzer, die ihre Inhalte löschen, wünschen in aller Regel auch keine weitere Veröffentlichung an anderer Stelle.

Zuvor hatte ich fünf Löschanfragen erhalten und im Anschluss zwei Verlinkungen (Verletzung der Persönlichkeit) sowie drei einzelne Tweets (gelöschte Beiträge) aus dem Archiv genommen.

Vor einigen Wochen nun hat das große Löschen begonnen und seitdem sind etwa 625.000* Tweets verschwunden. (* Es sollen hier nur Tweets zählen, die Rivva andernfalls gern in seiner Archivfunktion bewahrt hätte. Darüber hinaus vergisst der Bot jede Nacht ohnehin unzählige vergangene Tweets.)

Säulendiagramm: Entfernte Inhalte und betreffende Gründe

Entfernte Inhalte und betreffende Gründe
Ursprungsjahr Tweet gelöscht Tweet geschützt Account gelöscht Account gesperrt
2009 5.185 1.587 1.276 256
2010 8.528 3.165 2.796 492
2011 24.392 13.248 15.968 1.070
2012 66.005 31.163 45.293 3.292
2013 78.132 29.916 51.270 1.954
2014 81.869 30.328 77.909 1.660
2015 26.986 11.293 10.578 62

Daten: daten.rivva.de/v1.0/transparenz/rivva-recht-auf-vergessen-twitter.csv

Im Allgemeinen gilt die Löschregel, dass neuere Inhalte praktisch sofort aus der Datenbank fliegen, ältere mit linear wachsender Verzögerung.

Eine neue Seite zur Transparenz soll relevante wie aktuelle Informationen zur Sache bündeln.

Ich möchte dort langfristig eigentlich gerne mittels statistischer Methoden aufzeigen, wie sich Gesetze, Richtlinien und eigene Weltanschauung auf dieses Projekt gesammelt auswirken.

Alles gut?

2 Kommentare

Debattenmonitor für Süddeutsche.de

„Ihre SZ“ wagt das Experiment, den Leserdialog neu zu denken.

Um sich im Debattenforum künftig auf drei große Themen am Tag zu konzentrieren, hat Süddeutsche.de die klassische Kommentarfunktion entfernt und verlinkt für eure Facebook-Kommentare und Twitter-Antworten jetzt unter allen Texten auf Rivva.

Dank einmal wieder an Stefan Plöchinger und sein Team.

rivva.de/sz.de/social liefert eine Übersicht der immer aktuell meistdiskutierten Artikel auf SZ.de.

18 Kommentare

»Everything as a Service«

Es war einmal ein Rails-Projekt, dann wurden es drei, heute setzt sich Rivva aus 21 Apps zusammen.

Was vor mehr als zwei Jahren mal als Notwendigkeit begann, die doch stark angewachsene Codebasis gnadenlos in wieder ma­nage­bare Komponenten, Module und APIs zu partitionieren, ist mittlerweile weit gediehen. Zwar bin ich noch lange nicht durch, aber es sieht sehr gut aus.

Ruby/Rails-Upgrades fallen jetzt sehr viel einfacher. Das System ist besser verteilbar und überwachbar. Funktionen werden wieder sehr flexibel komponierbar und aus anderen Sprachen heraus herrlich leicht integrierbar. Man bekommt für praktisch alles ein API, eine Lib, ein CLI. Microservices eben.

Für neue Sachen dagegen war eher wenig Zeit. Große Teile im laufenden Betrieb in möglichst kleinen Schritten umzustellen, dauert leider ewig.

Dass Rivva hie und da hinkt, ist mir sonnenklar. Der gute alte Hyperlink, von Rivva immer noch heiß und innig geliebt, verliert beispielsweise rasant an Bedeutung, schwant mir. Irgendwie, irgendwo, irgendwann wird sich Rivva von seinen Idealen lösen und an die Realitäten des Netzes anpassen müssen.

Wo wir bereits beim klar Schiff machen sind: Die Kommentare waren hier sehr lange geschlossen, weil der letzte Blogeintrag schon bald ein Jahr zurückliegt. Klebt mir also gern unten ein, was euch derzeit alles nervt oder fehlt.

Zum gegenwärtigen Stand hat torial auch ein kleines Interview geführt.

9 Kommentare

Neuer SZ.de-Newsscanner integriert rivva.de

Die Social-Media-Presseschau von Süddeutsche.de bindet unter "Leser empfehlen > in Blogs" nun auch Blogs an.

Vielen lieben Dank an Stefan Plöchinger, Daniel Schumacher und Team.

3 Kommentare

»Everything is a Stream«

$ sudo rails new rivva

Rivva hat über die Jahre viel mit dem Design für News gespielt. Was allein die Bildersuche bei Google alles zu Tage fördert, ist ein Riesenspaß durchzuschauen. Vieles davon war tatsächlich Spielerei, doch ein paar Ideen bleiben ja immer haften.

Heute präsentiert sich Rivva abermals in neuem Anstrich. Es wird der größte Tapetenwechsel seit Projektbeginn.

Rivers of News

Leser der ersten Stunde wird es vielleicht freuen: “News is a river” ist “back to the future”.

Die neue Homepage teilt sich nun in viele Flüsse:

  1. Top Stories des Tages
  2. Timeline neuer Artikel
  3. Populäre Artikel nach Thema

Neue Artikel tauchen zunächst in der Timeline auf und können von dort entweder in den Top Stories nach oben schwimmen oder schließlich in die einzelnen Themenkanäle unten münden.

Alles ist ein Fluss. Auch den Fluss hinunter kann sich erneut ein Strom bilden.

Zumindest ist es ein Anfang.

Und die Seite ist device-responsive, mobile-first.

24 Kommentare

Artikelempfehlungen in WiWo iPad-App

Inzwischen haben wir den Empfehlungsdienst "Mehr zum Thema im Netz" auch in die WirtschaftsWoche auf dem iPad integriert.

Dank geht an Thomas Stölzel und Thomas Dingler.

0 Kommentare

Rivva und das Leistungsschutzrecht (2)

Am 1.8. tritt nun das Leistungsschutzrecht für Presseverleger in Kraft. In diesem Eintrag möchte ich beantworten, welche Konsequenzen Rivva aus dem Gesetz zieht.

No Snippet

Teil (1) kündigte schon an, weshalb die Anrisstexte leider weichen müssen. Rivva lässt die Snippets deshalb jetzt jeden Tag ein bisschen mehr verblassen, bis sie am Ersten schließlich gar nicht mehr lesbar sind.

Opt-in

Eine Reihe von Verlagen hat inzwischen in eigener Sache erklärt, dass sie das LSR nicht nutzen werden – dass die Verlinkung ihrer Publikationen unter Übernahme kurzer Textausschnitte weiter willkommen ist, keine vorherige Genehmigung benötigt oder gar in Rechnung gestellt wird.

Einige große Namen fehlen jedoch und werden auf rivva.de in Zukunft schwer vermisst.

Circa 650 Lokalzeitungen, Magazine und ihre Blogs werden angesichts der aktuellen Rechtsunsicherheit nicht mehr in der Aggregation auftauchen.

Es ist traurig. Der bürokratische Aufwand, um alle interessanten Quellen einzeln um Erlaubnis zu fragen, sprengt ein Ein-Personen-Projekt. Was fehlt, ist ein maschinenlesbarer Standard.

Sunsetting

Rivva Search und Social ereilt das Unvermeidliche.

 

Danke an alle, die gekämpft haben.

122 Kommentare

 

Archiv: 2016 (3)   2015 (1)   2014 (2)   2013 (8)   2012 (11)   2011 (4)   2010 (8)   2009 (18)   2008 (12)   2007 (17)   alle (84)