Skip to content

Instantly share code, notes, and snippets.

@sathomas
Last active March 23, 2021 13:04
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save sathomas/cf526d6495811a8ca779946ef5558702 to your computer and use it in GitHub Desktop.
Visualizing Bayes, Markov Chain Monte Carlo

Thanks to the excellent bayes.js library from Rasmus Bååth it's now possible to experiment with Bayesian statistics in JavaScript. We'll take advantage of that library in this series of posts, which demonstrate Bayesian statistics visually for anyone with a web browser.

This first post covers Markov Chain Monte Carlo (MCMC), algorithms which are fundamental to modern Bayesian analysis. MCMC is also critical to many machine learning applications. Since this is the first post, though, we'll start with a brief introduction to Bayesian statistics.

Thanks to the excellent bayes.js library from Rasmus Bååth it's now possible to experiment with Bayesian statistics in JavaScript. We'll take advantage of that library in this series of posts, which demonstrate Bayesian statistics visually for anyone with a web browser.

This first post covers Markov Chain Monte Carlo (MCMC), algorithms which are fundamental to modern Bayesian analysis. MCMC is also critical to many machine learning applications. Since this is the first post, though, we'll start with a brief introduction to Bayesian statistics.

Bayes' Rule in Probability

The simplest form of Bayesian analysis arises in probability, not statistics, in the form of Bayes' Rule (or Bayes' Theorem). Bayes' Rule describes relationships between conditional probabilities. Formally, it says:

$ Prob(B|A) = (Prob(A|B) * Prob(B))/(Prob(A)) $

That equation describes the probability of event B happening when it is known that event A has already happened. More succinctly, it's the probability of "B given A." The rule tells us that the probability of B given A equals the probability of A given B, times the probability of B, divided by the probability of A.

Breast cancer and mammography are often used as an example of Bayes Rule, and we'll stick with tradition here. Suppose that if a women has breast cancer, then a mammogram successfully detects that cancer 90% of the time. Now imagine that a particular women has a routine mammogram that returns a positive result. What is the chance that she has breast cancer?

The answer is definitely not 90%, although many people (including many doctors!) mistakenly think it to be. Bayes' Rule tells us the real answer:

  • Prob(B|A). This is the value we want to find. It's the probability that a woman has breast cancer (event B) given that her mammogram is positive (event A).

  • Prob(A|B). This is the value that we know. It is the probability that a woman with breast cancer (event B) will have a positive mammogram (event A). The value is 0.90.

  • Prob(B). The probability that a woman (any woman) has breast cancer. If one out of every hundred women have breast cancer, Prob(B) = 0.01.

  • Prob(A). The probability that a mammogram returns a positive result whether or not the women has breast cancer. If the "false alarm rate" of mammography is 9%, then this probability is 0.0981. (0.9 × 0.01 + 0.09 × 0.99)

Evaluating Bayes' Rule with these values tells us that the chance of the woman having breast cancer is less than 10%.

Bayesian Statistics

Bayesian statistics simply takes Bayes' Rule for probabilities and applies it to statistical inference. Here's an example. Suppose we're handed a coin from a novelty shop. Given its origin, we're justifiably suspicious that this coin is a "fair coin," equally likely to land heads or tails if flipped. As an experiment, we flip the coin five times and observe that each flip results in heads. What can we infer about the "fairness" of the coin?

The "fairness" of the coin is commonly known as its bias, usually represented as p where p is the probability of a toss resulting in heads. Given the results of our experiment, we'd like to draw some conclusions about p. The first thing we observe is that if we knew the value of p, we could calculate the probability of observing five heads in our five tosses. It's simply p⁵. We can state that more formally as:

$ Prob(x|p) = p^5 $

In that equation, x is the result of our experiment.

Perhaps you've already noticed how Bayes' Rule can help. We know Prob(x|p), but what we want to know is Prob(p|x). We want to swap x and p in the conditional probability, and that's exactly what Bayes' rule does

$ Prob(p|x) = (Prob(x|p) * Prob( p )) / (Prob(x)) $

Since we know Prob(x|p) we're only missing Prob( p ) and Prob(x).

The Challenge with Bayesian Statistics

Actually, there are two challenges with Bayesian statistics, and they're both captured in the last sentence of the previous section: We need a value for Prob( p ), and we need a value for Prob(x). The problem with the first is philosophy, and the problem with the second is mathematics.

To appreciate the philosophical problem, consider what it means to specify Prob( p ). That's the probability for p without any conditions. We have to supply a probability before we've performed any experiments, despite the fact that whole reason we want to do experiments is to figure out that probability. At the risk of greatly over-simplifying, the Bayesian answer to this dilemma is "Give it your best guess." To critics of Bayesian statistics (and there are many) this approach introduces subjectivity into the analysis. After all, a guess is intrinsically subjective. Different analysts may well have different guesses. Bayesian proponents counter that all statistics are fundamentally subjective; non-Bayesian analyses (such as the traditional null hypothesis significance testing) hide subjectivity by making it implicit (e.g. in the choice of null and alternative hypotheses) rather than explicit. Furthermore, as we'll see, there are steps we can take to reduce subjectivity in real world problems. Besides, despite it's subjectivity, Bayesian statistics certainly works, and that's good enough for us.

The more interesting challenge is supplying a value for Prob(x). That's the probability of our observations (five consecutive heads) without any conditions. The value is easy to describe mathematically.

$ Prob(x) = int Prob(x|p) * Prob( p ) $

That equation takes a bit of liberty with precise mathematical definitions (instead of Prob(•) we should really use the probability density function f(•)), but it captures the essential idea. Just multiply a couple of functions and calculate the integral of the result. In practice, however, it's not that simple. In many real world applications it turns out that it's not possible to calculate that integral. More precisely, the integral is intractable. If we're going to use Bayesian statistics in a lot of applications, we have to find an alternative to solving for the integral.

Markov Chain Monte Carlo to the Rescue

No doubt you've guessed this already, but Markov Chain Monte Carlo (MCMC) is just such an alternative. MCMC has it roots in the Manhattan Project during (and after) World War 2. Physicists in that project ran into similar intractable mathematics. In response they developed special algorithms to calculate their integrals using computers. It took a few decades for statisticians to recognize that the same approach could be adapted for Bayesian statistics, but MCMC has since become essential to Bayesian analysis. In fact there are now several special-purpose applications that implement MCMC. The bayes.js library gives us a taste of what those programs can do, and all it needs is a web browser.

We're not going to dig too deeply into the implementation of MCMC; the math required is pretty heavy duty. But we can get a good sense of how MCMC works. Let's start by looking at how MCMC could analyze the bias of our novelty coin. Here are the inputs to MCMC:

  • First we need to tell the algorithm our "best guess" for Prob( p ). That guess, however, isn't a single number. Instead, it's a probability distribution. That distribution indicates how likely we think any particular value of p might be. This is the part of Bayesian statistics that critics disparage as subjective. In fact, though, we don't really have to be subjective at all. By definition the coin's bias must be between 0 and 1, so for our guess we can simply say that any value between 0 and 1 is equally likely. More technically, we say that the prior distribution for p is uniform from 0 to 1.
  • Next we tell the algorithm how to calculate Prob(x|p), also in the form of a probability distribution. It's known as the likelihood. The likelihood is typically defined for any arbitrary experimental results, not just the particular observation we have at hand.
  • Finally we tell the algorithm what data we have actually observed.

To get started, an MCMC algorithm picks a random value for the parameter p. (In some implementations, we can specify a starting value.) In either case, the value is consistent with the prior distribution. For p, that means that the MCMC algorithm will be sure to select a random value between 0 and 1, with any value in that range equally likely.

Now the algorithm is ready to go. It repeats the following steps for as long as we ask it to run, typically tens or hundreds of thousands of times:

  1. Randomly pick a new value for the parameter p that's still consistent with the prior distribution, yet somewhat different than the current value. (Call the new value p₂ and the old value p₁.)
  2. Compare the probability of the new parameter value, Prob(p₂|x), with the probability of the old parameter value, Prob(p₁|x). (More on this in a minute.)
  3. The details depend on the specific MCMC algorithm, but in general, if the new value is more likely than the old value, then the algorithm tends to replace the old with the new. If, on the other hand, the new value is less likely than the old value, the algorithm tends to keep the old value.
  4. Repeat.

At first glance, step 2 may seem a little troubling. How can we compare two values for Prob(p|x)? We've already noted that calculating that value often involves really gnarly integrals that are beyond the capabilities of current mathematics. Here's the trick, though: We don't have to actually calculate the values to compare them. Consider the question we're asking:

Is Prob(p₂|x) > Prob(p₁|x) ?

Substitute the expression for Prob(p|x) in that comparison and it becomes:

Is Prob(x|p₂) · Prob(p₂) / Prob(x) > Prob(x|p₁) · Prob(p₁) / Prob(x) ?

Notice that the gnarly part of the expression, Prob(x), appears on both sides of the comparison. It cancels out, therefore, and our comparison is simply:

Is Prob(x|p₂) · Prob(p₂) > Prob(x|p₁) · Prob(p₁) ?

With the nasty Prob(x) eliminated, we're left with values that we can compute. For both values of p, we're simply multiplying the prior by the likelihood.

Now here's the neat thing about MCMC. Once the algorithm has gotten a good head of steam (generally by the first few hundred or first few thousand passes), the values that it selects for p will mirror the parameter's true probability distribution.

To understand p, all we have to do is let the MCMC algorithm run for a few hundred passes (called burn-in) and then record the values of p it picks for the next few thousand passes. It may not be as satisfying as having a simple equation for the distribution of p, but it's exactly the information we want.

Visualizing MCMC

We can see this play out in the visualization that introduces this post. And to make things more interesting we've graduated from one dimension to two. Here's the setup: we have ten observations that we believe have come from a Normal distribution, but we don't know the mean or the standard deviation of that distribution. MCMC will give us estimates for both parameters.

Just so you know, I cheated a bit. In fact, I do know the parameters for the distribution from which the observations came. The mean was 100 and the standard deviation was 10. Let's keep those values secret from the MCMC algorithm, however, and see how close it comes in its estimates.

Most of the code in the example is standard D3.js boilerplate for a scatterplot. The more interesting bits are toward the end. That's where we set up and run the MCMC algorithm. First we specify the parameters for our model. Those are the unknown mean and standard deviation for the population from which our samples come. Both are real numbers, and the standard deviation must be positive.

var params = {
     mu: {type: "real"},
     sigma: {type: "real", lower: 0}
 };

Then we specify the model itself. The model includes both the prior distributions (of both parameters) and the likelihood. Consider the prior distributions first. In this example, we have two parameters, μ and σ. In the coin example there was only parameter, p, and we accounted for it as Prob( p ). With two parameters, we have to use Prob(μ and σ). Fortunately, we're assuming that those two parameters are independent of each other, so that's just the product of their individual probabilities.

Prob(μ and σ) = Prob(μ) · Prob(σ)

Now consider the likelihood. Here again we have more than one value. Since we have ten observations, there are ten separate likelihoods. Fortunately, independence again helps. To find the likelihood of all ten observations we simply multiply each together.

Prob(x|μ,σ) = Prob(x₁|μ,σ) · Prob(x₂|μ,σ) · Prob(x₃|μ,σ) · … · Prob(x₁₀|μ,σ)

There's one last trick to calculating the value that the MCMC algorithm needs for its comparison. Instead of multiplying all these values together (and risking numeric overflow or rounding errors), we can take the logarithm. Comparing logarithms has the same effect as comparing the original values.

Our model function, in full, simply returns the value

log[ Prob(x|μ,σ) · Prob(μ) · Prob(σ) ]

Here's the code in full. The bayes.js library will call this function each time it generates a new value for the parameters, and it will use the returned value for its comparisons.

var model = function(state, obs) {
    var log_post = 0;
    // Priors
    log_post += ld.norm(state.mu, 0, 100);
    log_post += ld.unif(state.sigma, 0, 100);
    // Likelihood
    for(var i = 0; i < obs.length; i++) {
        log_post += ld.norm(obs[i], state.mu, state.sigma);
    }
    return log_post;
};

The rest of the code is pretty straightforward. When the user clicks on the chart, the visualization starts executing. It runs the MCMC algorithm one step at a time and plots the resulting parameter values on the chart. The first 400 values, which we'll discard as burn-in, are plotted in green, and the final 600 values are orange. When all the iterations are complete, the code shows the mean values for both μ and σ. If things are working correctly, those values should be close to the population parameters of 100 and 10.

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>bayes.js Demo</title>
<script src="https://rawgit.com/rasmusab/bayes.js/master/mcmc.js"></script>
<script src="https://rawgit.com/rasmusab/bayes.js/master/distributions.js"></script>
<script src="https://d3js.org/d3.v3.min.js"></script>
<link href='http://fonts.googleapis.com/css?family=Varela'
rel='stylesheet' type='text/css'>
<style>
body {
color: #444;
font-family: Varela,sans-serif;
}
.axis path, .axis line {
fill: none;
stroke: #888;
shape-rendering: crispEdges;
}
.target {
fill: #ca0000;
}
</style>
</head>
<body>
<div id="graph"></div>
<script>
// Standard D3.js scatterplot set up
var margin = {top: 40, right: 40, bottom: 40, left: 40},
width = 636 - margin.left - margin.right,
height = 476 - margin.top - margin.bottom;
var svg = d3.select("#graph").append("svg")
.attr("height", height + margin.left + margin.right)
.attr("width", width + margin.top + margin.bottom);
var chart = svg.append("g")
.attr("transform",
"translate(" + margin.left + "," + margin.top + ")"
);
var xScale = d3.scale.linear()
.range([0,width]);
var yScale = d3.scale.linear()
.range([height,0]);
// Since we know that the actual mean and standard
// devation are 100 and 10, respectively, we can set
// the domains appropriately.
xScale.domain([0, 200]);
yScale.domain([0, 20]);
var xAxis = d3.svg.axis()
.scale(xScale)
.orient("bottom");
var yAxis = d3.svg.axis()
.scale(yScale)
.orient("left");
chart.append("g")
.attr("class", "axis")
.attr("transform", "translate(0," + height + ")")
.call(xAxis)
.append("text")
.attr("x", width)
.attr("y", -10)
.style("text-anchor", "end")
.text("μ");
chart.append("g")
.attr("class", "axis")
.call(yAxis)
.append("text")
.attr("x", 15)
.attr("y", 6)
.style("text-anchor", "end")
.text("σ");
// Add a special target that the sampler should
// converge around.
chart.append("circle")
.attr("class", "target")
.attr("r", 6)
.attr("cx", xScale(100))
.attr("cy", yScale(10));
chart.append("text")
.attr("x", width)
.attr("y", 20)
.classed("notes", true)
.style("text-anchor", "end")
.text("Click to start");
function addPoint(d, prev) {
if (prev) {
prev.transition().attr('fill-opacity', '0.1');
}
return chart.append("circle")
.attr('fill-opacity', '1')
.attr("r", 4)
.attr("cx", xScale(d.mu[0]))
.attr("cy", yScale(d.sigma[0]));
}
// Set up the Bayesian analysis by defining the
// parameters, the model, and the observations.
// Create an MCMC sampler to evaluate the model.
// We want to model the observations as coming
// from a Normal distribution with mean `mu`
// and standard deviation `sigma`. So those
// are our parameters. Both are real, and, of
// course, we need to ensure that `sigma` is
// positive.
var params = {
mu: {type: "real"},
sigma: {type: "real", lower: 0}
};
// For the model itself, we first specify the
// prior probabilities for the parameters. We
// don't want to assume much about those parameters,
// so we set the priors as follows:
//
// - `mu`: normal with mean of 0 and a huge
// standard deviation. Effectively,
// `mu` could be anything.
// - `sigma`: uniform from 0 to 100. We're
// saying that `sigma` could be
// practically anything also, as
// long as it's positive.
//
// After specifying the prior probabilities,
// we specify the likelihood. In this case,
// we're simply modeling the observations as
// data form a Normal distribution with a
// a mean of `mu` (whatever that value turns
// out to be) and a standard deviation of
// `sigma` (whatever it turns out to be).
var model = function(state, obs) {
var log_post = 0;
// Priors
log_post += ld.norm(state.mu, 0, 100);
log_post += ld.unif(state.sigma, 0, 100);
// Likelihood
for(var i = 0; i < obs.length; i++) {
log_post += ld.norm(obs[i], state.mu, state.sigma);
}
return log_post;
};
// Here are the "observations." Don't tell
// the sampler, but they were drawn from a
// normal distribution with mean 100 and
// standard deviation of 10. We'll see how
// close the sampler can get to those values.
var obs = [
120.6, 103.3, 108.1, 103.7, 96.35,
102.1, 96.23, 106.9, 111.2, 100.2
];
var sampler = new mcmc.AmwgSampler(params, model, obs);
// Define the total number of samples
// that we'll generate, as well as how
// many of them we'll use as burn-in.
var numSamples = 1000,
burnin = 400;
// Wait for a click to start the simulation
d3.select('body').on('click', function() {
d3.select(".notes").text("");
var pts = 1,
d = sampler.sample(1),
mu = 0,
sigma = 0,
start = addPoint(d),
prev;
// Mark the starting point
start.attr('r', '6')
.attr('fill', '#007979');
var interval = setInterval(function () {
d = sampler.sample(1);
prev = addPoint(d, prev);
if (pts > burnin) {
prev.attr('fill', '#ca5c00');
mu += d.mu[0];
sigma += d.sigma[0];
} else {
prev.attr('fill', '#7EBD00');
}
if (pts++ > numSamples) {
clearInterval(interval);
var dist = "N(μ = " +
Math.round(10*mu/(pts-burnin)/10) + ", σ = " +
Math.round(10*sigma/(pts-burnin)/10) + ")";
d3.select(".notes").text(dist);
}
}, 100)
});
</script>
</body>
</html>
http://jsdatav.is/img/thumbnails/mcmc.png
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment