Skip to content

Instantly share code, notes, and snippets.

@josiahdavis
Last active January 13, 2017 21:54
Show Gist options
  • Save josiahdavis/f5b8944dd2288d4400beb5a40084edbf to your computer and use it in GitHub Desktop.
Save josiahdavis/f5b8944dd2288d4400beb5a40084edbf to your computer and use it in GitHub Desktop.
Simple bootstrapping example
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bootstrapping\n",
"\n",
"1. What is bootstraping?\n",
"2. An example in Python\n",
"3. Why bootstrap?\n",
"4. Confidence Intervals\n",
"5. References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is bootstrapping?\n",
"\n",
"Bootstrapping is repeated resampling with replacement, often for the purpose of estimating the standard error of a statistic"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Repeated resampling with replacement **\n",
"\n",
"Imagine that you have a fruit basket with four fruit in it (your sample). You take out a fruit, record the value of that fruit, and put it back in the basket. Then you take out another fruit, record what type it is, and put it back in the basket."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"np.random.seed(100)\n",
"sample = np.array(['apple', 'orange', 'pear', 'bananna'])\n",
"resample1 = np.random.choice(sample, size = 4, replace = True)\n",
"resample2 = np.random.choice(sample, size = 4, replace = True)\n",
"resample3 = np.random.choice(sample, size = 4, replace = True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"resample1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"resample2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"resample3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Statistic **\n",
"\n",
"What is the difference between a statistic and a parameter? A statistic is a property of the sample, while a parameter is a property of the distribution. Here a few examples that you will recognize:\n",
"\n",
"- $\\hat{\\theta}=min(X)$\n",
"- $\\hat{\\theta}=\\bar{X}$\n",
"- $\\hat{\\theta}=median(X)$\n",
"- $\\hat{\\theta}=\\hat{\\beta_1}$ (Linear regression coefficient)\n",
"- $\\hat{\\theta}=R^2$ (Coefficient of determination)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sample = np.array([1, 2, 3, 4, 5, 6, 12])\n",
"print('Min:', np.min(sample))\n",
"print('Max:', np.max(sample))\n",
"print('Mean:', np.mean(sample))\n",
"print('Variance:', np.var(sample))\n",
"print('Median:', np.median(sample))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a diagram that shows how these two concepts work togther.\n",
"![title](bootstrapping.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Estimating the Standard Error**\n",
"> Standard error is a general term for the standard deviation of a summary statistic. They are the most common way of determining statistical accuracy.\n",
"\n",
"Efron and Tibshirani, An Introduction to the Bootstrap\n",
"\n",
"We need to estimate the standard error of a statistic, because we usually won't know the parameters of the distribution that the statistic is summarized from."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Putting it all together **\n",
"\n",
"Here is a simple three-step algorithm:\n",
"1. Draw B bootstrapped samples from the original sample\n",
"2. Calculate the statistic for each of the bootstrapped samples\n",
"3. Estimate the standard error by calculating the standard deviation of the statistics\n",
"\n",
"This algorithm is adapted from p. 47 of An Introduction to the Bootstrap"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## An example in Python\n",
"\n",
"Let's drive this home with a simple example in python. In this example we are going to estimate the standard error of the mean statistic. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"np.random.seed(100)\n",
"sample = np.random.normal(loc = 10, scale = 3, size = 15)\n",
"sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Estimate of the statistic\n",
"thetaHat = np.mean(sample)\n",
"thetaHat"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"n = sample.shape[0]\n",
"B = 1000\n",
"statBoot = np.zeros(B)\n",
"\n",
"for b in range(B):\n",
" resample = np.random.choice(sample, size = n, replace = True)\n",
" statBoot[b] = np.mean(resample)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's inspect the result through visualization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"plt.style.use('ggplot')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def plot_hist(array, ax):\n",
" '''Plot a histogram of bootstrapped values'''\n",
" kwargs = dict(bins=30, normed=True, color='steelblue', edgecolor='none')\n",
" ax.hist(array, **kwargs)\n",
" ax.set(title='Bootstrapping', ylabel='Density', xlabel = 'Statistic')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1)\n",
"plot_hist(statBoot, ax)\n",
"ax.axvline(x = thetaHat, linestyle = '--', color = 'black')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Estimate of the standard error\n",
"np.sqrt(np.var(statBoot))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Why bootstrap?\n",
"There is a closed form equation for estimating the standard error of the sample mean is easily derived in mathmatical closed form for the mean: $$\\hat{se}(\\bar{X})=\\frac{s}{\\sqrt(n)}$$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Closed-form estimate of the standard error\n",
"np.sqrt(np.var(sample) / n) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So why bother with the bootstrap? \n",
"\n",
"The answer is that the standard error for many statistics is not so easy to derive mathmatically. For example, it is very difficult to mathmatically derive the standard error for the sample median."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculating Confidence Intervals\n",
"One of the main applications of the bootstrap is to calculate confidence intervals, which can be used to quantify a range of possible values for your parameter. \n",
"\n",
"As a reminder, the way to interprete a 95% confidence interval is that 95% of the time, the confidence interval will trap the true parameter.\n",
"\n",
"![title](confidence_intervals.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Calculating Normal Intervals **\n",
"\n",
"The easiest way to calculate a confidence interval is to assume that your sampling statistic follows a normal distribution: \n",
"$$\\hat{\\theta} \\pm 1.96\\hat{se}$$\n",
"\n",
"Where z = 1.96 is the z value associated with a 95% confidence interval and $\\hat{se}$ is the estimated standard error from the bootstrapping. \n",
"\n",
"In our toy example, this is how we would compute the confidence interval:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"thetaHat = np.mean(sample)\n",
"seHat = np.sqrt(np.var(statBoot))\n",
"[thetaHat - 1.96 * seHat, thetaHat + 1.96 * seHat]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, this method does not always work. Let's examine a more complicated example where this normal method breaks down.\n",
"\n",
"In this example the statistic we are estimating is the correlation between average LSAT scores and the GPA for law schools. In this example, the population is 82 law schools, and our sample is 15 of those law schools."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"law = pd.read_csv('law_sample.csv')\n",
"law"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sample = law.values\n",
"n = sample.shape[0]\n",
"B = 1000\n",
"statBoot = np.zeros(B)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def calc_corr(array):\n",
" '''Calculate pearsons correlation coefficient for 1st and \n",
" 2nd column in an array'''\n",
" return(np.corrcoef(array[:,1], array[:,2])[0, 1])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"np.random.seed(100)\n",
"for b in range(B):\n",
" idx = np.random.choice(range(n), size = n, replace = True)\n",
" resample = sample[idx,:]\n",
" statBoot[b] = calc_corr(resample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"statBoot[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's calculate the confidence interval as before, using the normal method"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"thetaHat = calc_corr(sample)\n",
"seHat = np.sqrt(np.var(statBoot))\n",
"print(thetaHat)\n",
"normalCI = [thetaHat - 1.96 * seHat, thetaHat + 1.96 * seHat]\n",
"normalCI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we come to our first of two issues. The correlation coefficient is only defined on the range [-1, 1], yet the normal confidence interval does not preserve this range property.\n",
"\n",
"To see the second issue, let's take a look at the distribution of the statistic."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1)\n",
"plot_hist(statBoot, ax)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the distribution is not normally shaped. So it doesn't really make sense to use the normal confidence intervals here for that reason either.\n",
"\n",
"**Percentile Confidence Intervals**\n",
"\n",
"A more robust method for calculating confidence intervals using the bootstrap is to use the percentiles of the bootstrapped distribution:\n",
"- Percentile method respects the range of the statistic\n",
"- Does not assume that the distribution of the statistic is normally distributed\n",
"\n",
"We can calculate the percentiles 2.5%, for example by identifying the value that 2.5% of the sampled statistics are less than."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"percentileCI = np.percentile(statBoot, [2.5, 97.5])\n",
"percentileCI"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1)\n",
"plot_hist(statBoot, ax)\n",
"ax.axvline(x = thetaHat, color = 'black')\n",
"ax.axvline(x = percentileCI[0], linestyle = '--', color = 'black')\n",
"ax.axvline(x = percentileCI[1], linestyle = '--', color = 'black')"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## References\n",
"1. **Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. Chapman and Hall, New York, London.** The definitive guide to Bootstrapping written by the inventor of the Bootstrap and one of his most prominent graduate students ([Amazon link](https://www.amazon.com/Introduction-Bootstrap-Monographs-Statistics-Probability/dp/0412042312/ref=sr_1_1?ie=UTF8&qid=1484328384&sr=8-1&keywords=an+introduction+to+the+bootstrap)).\n",
"\n",
"2. **Freedman, D., Pisani, R. and Purves, R. (2007) Statistics. W. W. Norton & Company, New\n",
"York, London.** Widely considered to be \"The Bible\" for teaching college statistics, this is a great book for anyone who didn't get the chance to take Statistics as an undergraduate level, or who simply wants a refresher on basic concepts ([Amazon link](https://www.amazon.com/Statistics-4th-David-Freedman/dp/0393929728/ref=sr_1_3?ie=UTF8&qid=1484328391&sr=8-3&keywords=statistics))."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
LSAT GPA
576 3.39
635 3.3
558 2.81
578 3.03
666 3.44
580 3.07
555 3
661 3.43
651 3.36
605 3.13
653 3.12
575 2.74
545 2.76
572 2.88
594 2.96
School LSAT GPA
1 622 3.23
2 542 2.83
3 579 3.24
4 653 3.12
5 606 3.09
6 576 3.39
7 620 3.1
8 615 3.4
9 553 2.97
10 607 2.91
11 558 3.11
12 596 3.24
13 635 3.3
14 581 3.22
15 661 3.43
16 547 2.91
17 599 3.23
18 646 3.47
19 622 3.15
20 611 3.33
21 546 2.99
22 614 3.19
23 628 3.03
24 575 3.01
25 662 3.39
26 627 3.41
27 608 3.04
28 632 3.29
29 587 3.16
30 581 3.17
31 605 3.13
32 704 3.36
33 477 2.57
34 591 3.02
35 578 3.03
36 572 2.88
37 615 3.37
38 606 3.2
39 603 3.23
40 535 2.98
41 595 3.11
42 575 2.92
43 573 2.85
44 644 3.38
45 545 2.76
46 645 3.27
47 651 3.36
48 562 3.19
49 609 3.17
50 555 3
51 586 3.11
52 580 3.07
53 594 2.96
54 594 3.05
55 560 2.93
56 641 3.28
57 512 3.01
58 631 3.21
59 597 3.32
60 621 3.24
61 617 3.03
62 637 3.33
62 572 3.08
64 610 3.13
65 562 3.01
66 635 3.3
67 614 3.15
68 546 2.82
69 598 3.2
70 666 3.44
71 570 3.01
72 570 2.92
73 605 3.45
74 565 3.15
75 686 3.5
76 608 3.16
77 595 3.19
78 590 3.15
79 558 2.81
80 611 3.16
81 564 3.02
82 575 2.74
1 622 3.23
2 542 2.83
3 579 3.24
4 653 3.12
5 606 3.09
6 576 3.39
7 620 3.1
8 615 3.4
9 553 2.97
10 607 2.91
11 558 3.11
12 596 3.24
13 635 3.3
14 581 3.22
15 661 3.43
16 547 2.91
17 599 3.23
18 646 3.47
19 622 3.15
20 611 3.33
21 546 2.99
22 614 3.19
23 628 3.03
24 575 3.01
25 662 3.39
26 627 3.41
27 608 3.04
28 632 3.29
29 587 3.16
30 581 3.17
31 605 3.13
32 704 3.36
33 477 2.57
34 591 3.02
35 578 3.03
36 572 2.88
37 615 3.37
38 606 3.2
39 603 3.23
40 535 2.98
41 595 3.11
42 575 2.92
43 573 2.85
44 644 3.38
45 545 2.76
46 645 3.27
47 651 3.36
48 562 3.19
49 609 3.17
50 555 3
51 586 3.11
52 580 3.07
53 594 2.96
54 594 3.05
55 560 2.93
56 641 3.28
57 512 3.01
58 631 3.21
59 597 3.32
60 621 3.24
61 617 3.03
62 637 3.33
62 572 3.08
64 610 3.13
65 562 3.01
66 635 3.3
67 614 3.15
68 546 2.82
69 598 3.2
70 666 3.44
71 570 3.01
72 570 2.92
73 605 3.45
74 565 3.15
75 686 3.5
76 608 3.16
77 595 3.19
78 590 3.15
79 558 2.81
80 611 3.16
81 564 3.02
82 575 2.74
School LSAT GPA
4 653 3.12
6 576 3.39
13 635 3.3
15 661 3.43
31 605 3.13
35 578 3.03
36 572 2.88
45 545 2.76
47 651 3.36
50 555 3
52 580 3.07
53 594 2.96
70 666 3.44
79 558 2.81
82 575 2.74
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment