Topic Browsing
Edinburgh Research Archive Wikitopics Map Wikileaks Cables News of the World OpenBiz StartUp Café

What is this?

We are experimenting with the use of automated topic modelling to generate browsable indexes for large, unstructured collections of documents. We are using the wikileaks cablegate corpus as a collection of topical interest, as well as other copora.

You must use an up-to-date WebKit browser—Google Chrome or a recent Safari Webkit nightly build—to use our browsers.

Browsers

Wikileaks Cable Browser

Try the Wikileaks Cable Browser.
Now indexing 6317 cables.

This browser lets you explore from topics-to-documents-to-topics and vice-versa. In this way you can easily find related documents, starting from a document of interest, or other topics related to a given topic.

The aim is to make it tractable to browse this corpus of over 6,000 cables, containing over 6 million words.

We also use the same browser with other corpora, News of the World OpenBiz StartUp Café .

Topic maps and Document maps

A "proof-of-concept" browser for the Edinburgh Research Archive is built by finding the authors who contribute most to a given topic and linking these authors together to form a social graph where people are drawn together by their common interests.

An earlier experimental browser is still available.

In the document map, we link each document to its closest neghbour. Clicking on a document takes you to that cable on the wikileaks site.

So we can also present the topic map inferred by LDA in a similar (actually dual) fashion. You can inspect the most frequent words in each topic by hovering over the node that represents it. You can hover over the link between two topics to see the ID of the cable that most-strongly supports this link, and you can click on the link to visit this document. The easiest place to hover and click is over the arrowhead.

The selected topic is highlighted in red on the topic map, and the documents with substantial contributions from this topic are also highlighted. Click on a topic to select it, or use the selection button.

Background

What is a topic?

We start from a collection of documents (each viewed as a bag of words), and use Latent Dirichlet Allocation (LDA) to model each document as a mixture of a number of topics.

A topic is a probability distribution over words. Once we choose a fixed number of topics, LDA provides a set of topics and the proportions in which they should be mixed in each document to best-approximate our collection.

How can topics help?

Topics correspond to groups of words that tend to occur together across a number of documents. Thus they identify common themes, or topics. Identifying these helps us to explore the collection.

Two documents can be linked by the topics they share. Similarly, two topics can be linked by the documents they have in common.

Is that all?

The site is under development and will change, hopefully rapidly.

Please explore and comment on the blog at http://wikitopics.blogspot.com. This page (the one you're reading now) can be accessed as http://bit.ly/wikitopics