Marcos Castro de Souza marcoscastro

K-means is a simple unsupervised machine learning algorithm that groups a dataset into a user-specified number (k) of clusters. The algorithm is somewhat naive--it clusters the data into k clusters, even if k is not the right number of clusters to use. Therefore, when using k-means clustering, users need some way to determine whether they are using the right number of clusters.

One method to validate the number of clusters is the elbow method. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE). Like this:

var sse = {};
for (var k = 1; k <= maxK; ++k) {
    sse[k] = 0;
    clusters = kmeans(dataset, k);
    clusters.forEach(function(cluster) {

mean = clusterMean(cluster);

	#!/usr/bin/python

	### Created by Lior Zimmerman (http://www.github.com/LiorZ) ###
	### Distributed under MIT License (http://opensource.org/licenses/MIT) ###

	import sys, getopt
	from Bio import SeqIO,pairwise2
	import Bio.SubsMat.MatrixInfo as matrices
	import sklearn.cluster as cluster

	using namespace std;
	#include <algorithm>
	#include <iostream>
	#include <iterator>
	#include <sstream>
	#include <fstream>
	#include <cassert>
	#include <climits>
	#include <cstdlib>
	#include <cstring>