K-Means clustering is an ML algorithm that is used to divide some data into clusters. Here, K means the number of clusters in which we want to divide the data into. In this edition, we, at Oodles, an experiential Artificial Intelligence Development Company, provide a step-by-step guide to implementing K-means clustering in Node.js for larger unlabeled data.

*Picture Source: en.wikipedia.org/wiki/K-means_clustering*

- Randomly select K data points to be the center point for clusters. Then, iterate over all the data points and assign each data point to the nearest cluster.
- After assigning every data point to a cluster, calculate the mean of every data point in a cluster, and recluster the data points to that mean.
- We need to repeat Step 2 until reclustering data points to the calculated mean doesn't affect the data points anymore.
- Now, this won't guarantee us the expected results. For getting better results we need to perform these steps (from Step1 to Step3) at least a number of times.

K-means clustering is an effective machine learning algorithm for segmenting large unlabeled datasets to further apply predictive analytics services effectively.

Let's try to understand it more with the help of an example,

First, we will require * skmeans* npm package using which we can implement K Means Clustering algorithm.

**const** skmeans = require("skmeans");

Let's define our input,

**var** input_data = [**0**, **1**, **2**, **3**, **4**, **1000**, **1001**, **1002**, **1003**, **1004**, **2000**, **2001**, **2002**, **2003**, **2004**];

Lets create a model using this input data,

**var** skmeansModel = skmeans(input_data,**3**, 'kmpp', **1000000**);

Note, *skmeans* module accepts the following parameters:-

- The input data
- Total number of clusters (meaning the value of K)
- [Options Field] Algorithm for selecting the initial clusters
- [Options Field] An upper bound on the number of iterations

Let's test out the model,

`console.log(skmeansModel.test(`**10**).idx) // 1
console.log(skmeansModel.test(**11**).idx) // 1
console.log(skmeansModel.test(**1002**).idx) // 2
console.log(skmeansModel.test(**1002**).idx) // 2
console.log(skmeansModel.test(**2002**).idx) // 0
console.log(skmeansModel.test(**2002**).idx) // 0

Note,* skmeansModel.test(<data point>)* returns an object containing a key called *IDX* which is the assigned cluster.

As we can see, our cluster works perfectly with the given input data and K, i.e., the number of clusters. But in real life, problems won't be this easy and data won't be this simple and so small. One big challenge that we face during this approach is selecting a value of K for our model.

Choosing the value of K -

K is said to be optimal when,

- the total variation within each cluster is low (not the lowest), and
- it's not too big

A simple trick is to start with a small value of K, then start increasing its value. We should continue increasing its value until the reduction in variation is lower than before.

If we are able to perform K Mean Clustering right with the help of suitable parameters we can land on a set of Centroids, which can be used to divide our data points into K clusters with a suitable variance.