Skip to content

Instantly share code, notes, and snippets.

View marcoscastro's full-sized avatar

Marcos Castro de Souza marcoscastro

View GitHub Profile
@marcoscastro
marcoscastro / README.md
Created December 3, 2015 21:04 — forked from rpgove/README.md
Using the elbow method to determine the optimal number of clusters for k-means clustering

K-means is a simple unsupervised machine learning algorithm that groups a dataset into a user-specified number (k) of clusters. The algorithm is somewhat naive--it clusters the data into k clusters, even if k is not the right number of clusters to use. Therefore, when using k-means clustering, users need some way to determine whether they are using the right number of clusters.

One method to validate the number of clusters is the elbow method. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE). Like this:

var sse = {};
for (var k = 1; k <= maxK; ++k) {
    sse[k] = 0;
    clusters = kmeans(dataset, k);
    clusters.forEach(function(cluster) {

mean = clusterMean(cluster);

@marcoscastro
marcoscastro / kmeans.py
Created November 19, 2015 14:16 — forked from LiorZ/kmeans.py
KMeans clustering python script for biological sequences
#!/usr/bin/python
### Created by Lior Zimmerman (http://www.github.com/LiorZ) ###
### Distributed under MIT License (http://opensource.org/licenses/MIT) ###
import sys, getopt
from Bio import SeqIO,pairwise2
import Bio.SubsMat.MatrixInfo as matrices
import sklearn.cluster as cluster
@marcoscastro
marcoscastro / gist:aebcc78538ec36f4229e
Created November 13, 2015 19:37 — forked from mejibyte/gist:1268157
Implementation of Ukkonen's algorithm to build a prefix tree in O(n)
using namespace std;
#include <algorithm>
#include <iostream>
#include <iterator>
#include <sstream>
#include <fstream>
#include <cassert>
#include <climits>
#include <cstdlib>
#include <cstring>