k-means clustering in security domain

3 min readJul 26, 2021

Every Machine Learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally broken down into two types supervised and unsupervised. K-means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.

What is meant by the K-means algorithm?

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

For a better understanding of k-means, let’s take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsman and bowlers.

Applications of K-Means Clustering

K-Means clustering is used in a variety of examples or business cases in real life, like:

Academic performance
Diagnostic systems
Search engines
Wireless sensor networks

How Does K-Means Clustering Work?

The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters.

Another method is to use the Elbow technique to determine the value of K. Once we get the K’s value, the system will assign that many centroids randomly and measure the distance of each of the data points from these centroids. Accordingly, it assigns those points to the corresponding centroid from which the distance is minimum. So each data point will be assigned to the centroid, which is closest to it. Thereby we have a K number of initial clusters.

For the newly formed clusters, it calculates the new centroid position. The position of the centroid moves compared to the randomly allocated one. Once again, the distance of each point is measured from this new centroid point. If required,the data points are relocated to the new centroids, and the mean position or the new centroid is calculated once again. If the centroid moves, the iteration continues indicating no convergence. But once the centroid stops moving (which means that the clustering process has converged), it will reflect the result.

Cyber Profiling using K-Means Clustering

Cyber Profiling :-

The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is more specifically based on what is known and not known about the criminal. Profiling is information about an individual or group of individuals that are accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity . Difficulties in implementing cyber profiling is on the diversity of user data and behavior when online is sometimes different from actual behavior. Given the privilege in personal behavior, inductive generalizations can be very reliable but can also lead to a misunderstanding of behavior analysis. Therefore the cyber-profiling process is via a combination of deductive and inductive methods. For investigation, the cyber-profiling process gives a good, contributing to the field of forensic computer science. Cyber Profiling is one of the efforts made by the investigator, to know the alleged offenders through the analysis of data patterns that include aspects of technology, investigation, psychology, and sociology.

Cyber Profiling process can be directed to the benefit of:

~ Identification of users of computers that have been used previously.

~ Mapping the subject of family, social life, work, or network-based organizations, including those for whom he/she worked.

~ Provision of information about the user regarding his ability, level of threat, and how vulnerable to threats

~ Identify the suspected abuser

k-means clustering in security domain

Written by RajSaundatikar