Topic Browsing Cablegate

What is this?

We are experimenting with the use of automated topic modelling to generate browsable indexes for large, unstructured collections of documents. We are using the wikileaks cablegate corpus as a collection of topical interest.

You must use an up-to-date WebKit browser—Google Chrome or a recent Safari Webkit nightly build—to use our browsers.

Browsers

Wikileaks Cable Browser

Try the Cablegate Topic Browser.
Now indexing 6317 cables.

This browser lets you explore from topics-to-documents-to-topics and vice-versa. In this way you can easily find related documents, starting from a document of interest, or other topics related to a given topic.

The aim is to make it tractable to browse this corpus of over 6,000 cables, containing over 3 million words.

Topic maps and Document maps

An earlier experimental browser is still available.

In the document map, we link each document to its closest neghbour. Clicking on a document takes you to that cable on the wikileaks site.

So we can also present the topic map inferred by LDA in a similar (actually dual) fashion. You can inspect the most frequent words in each topic by hovering over the node that represents it. You can hover over the link between two topics to see the ID of the cable that most-strongly supports this link, and you can click on the link to visit this document. The easiest place to hover and click is over the arrowhead.

The selected topic is highlighted in red on the topic map, and the documents with substantial contributions from this topic are also highlighted. Click on a topic to select it, or use the selection button.

Background

What is a topic?

We start from a collection of documents (each viewed as a bag of words), and use Latent Dirichlet Allocation (LDA) to model each document as a mixture of a number of topics.

A topic is a probability distribution over words. Once we choose a fixed number of topics, LDA provides a set of topics and the proportions in which they should be mixed in each document to best-approximate our collection.

How can topics help?

Topics correspond to groups of words that tend to occur together across a number of documents. Thus they identify common themes, or topics. Identifying these helps us to explore the collection.

Two documents can be linked by the topics they share. Similarly, two topics can be linked by the documents they have in common.

Is that all?

The site is under development and will change, hopefully rapidly.

Please explore and comment on the blog at http://wikitopics.blogspot.com. This page (the one you're reading now) can be accessed as http://bit.ly/wikitopics