K-Means clustering is an ML algorithm that is used to divide some data into clusters. Here, K means the number of clusters in which we want to divide the data into. In this edition, we, at Oodles, an experiential Artificial Intelligence Development Company, provide a step-by-step guide to implementing K-means clustering in Node.js for larger unlabeled data.
Picture Source: en.wikipedia.org/wiki/K-means_clustering
K-means clustering is an effective machine learning algorithm for segmenting large unlabeled datasets to further apply predictive analytics services effectively.
Let's try to understand it more with the help of an example,
First, we will require skmeans npm package using which we can implement K Means Clustering algorithm.
const skmeans = require("skmeans");
Let's define our input,
var input_data = [0, 1, 2, 3, 4, 1000, 1001, 1002, 1003, 1004, 2000, 2001, 2002, 2003, 2004];
Lets create a model using this input data,
var skmeansModel = skmeans(input_data,3, 'kmpp', 1000000);
Note, skmeans module accepts the following parameters:-
Let's test out the model,
console.log(skmeansModel.test(10).idx) // 1
console.log(skmeansModel.test(11).idx) // 1
console.log(skmeansModel.test(1002).idx) // 2
console.log(skmeansModel.test(1002).idx) // 2
console.log(skmeansModel.test(2002).idx) // 0
console.log(skmeansModel.test(2002).idx) // 0
Note, skmeansModel.test(<data point>) returns an object containing a key called IDX which is the assigned cluster.
As we can see, our cluster works perfectly with the given input data and K, i.e., the number of clusters. But in real life, problems won't be this easy and data won't be this simple and so small. One big challenge that we face during this approach is selecting a value of K for our model.
Choosing the value of K -
K is said to be optimal when,
A simple trick is to start with a small value of K, then start increasing its value. We should continue increasing its value until the reduction in variation is lower than before.
If we are able to perform K Mean Clustering right with the help of suitable parameters we can land on a set of Centroids, which can be used to divide our data points into K clusters with a suitable variance.