Visualising topics as distributions over words

Introduction

Nowadays the Web represents a medium through which corporations can effectively disseminate and demonstrate their efforts to incorporate sustainability practices into their business processes. This led to the idea of using the Web as a source of data to measure how UK companies are progressing towards meeting the new sustainability requirements recently stipulated by the United Nations. From a initial sample of 100 companies, 563 sustainability-related web pages were identified and collected.

Topic models represent a family of computer programs that extract topics from texts. A topic is intended here as a list of words that occur in statistically meaningful ways. Topic modelling algorithms do not require any prior annotations or labelling of the documents. Instead, the topics emerge from the analysis of the original texts. Latent Dirichlet Allocation is a special case of topic modelling. Given a collection of documents, it assigns to each topic a distribution over words and to each document a distribution over topics in an entirely unsupervised way.

Here we use LDA to identify topics on the text extracted from scraped web pages. Topics allow us to understand what areas of sustainability each web page covers, such as environment, supporting charities, employees well-being, etc., and in what proportion.

LDAvis

LDAvis is a web-based interactive visualisation of topics estimated using LDA. It provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic.

The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

The visualisation has two basic pieces.

The left panel visualise the topics as circles in the two-dimensional plane whose centres are determined by computing the Jensen–Shannon divergence between topics, and then by using multidimensional scaling to project the inter-topic distances onto two dimensions. Each topic’s overall prevalence is encoded using the areas of the circles.

The right panel depicts a horizontal bar chart whose bars represent the individual terms that are the most useful for interpreting the currently selected topic on the left. A pair of overlaid bars represent both the corpus-wide frequency of a given term as well as the topic-specific frequency of the term.

The λ slider allows to rank the terms according to term relevance. By default, the terms of a topic are ranked in decreasing order according their topic-specific probability ( λ = 1 ). Moving the slider allows to adjust the rank of terms based on much discriminatory (or "relevant") are for the specific topic. The suggested “optimal” value of λ is 0.6.