K-means cluster — use case in Security Domain
Clustering in Machine Learning
It is basically a type of unsupervised learning method. An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into several groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects based on similarity and dissimilarity between them.
2. Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :
- Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example, each customer is put into one group out of the 10 groups.
- Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario, each customer is assigned a probability to be in either of 10 clusters of the retail store.
What is k-means clustering?
K-means algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
A K-means clustering based security framework for mobile data mining
An agent that belongs to an aggregator module will be used to gather raw data from the designated internet sources. At this point, the ﬁrst structure is provided to the unprocessed data collected before it is ﬁnally stored in an appropriate database.
Prepossessing and ﬁltering
A common format will be provided to all the data sets collected from various Internet sources. The system is required to differentiate the uninteresting from the interesting documents. To accomplish this task, a scoring system will be applied to each piece of data obtained. The distinguished documents are then inserted into a data warehouse. The documents are stored in a format that enables further processing. Then simulation and future work are presented in the last two sections
Analytics and alerting
Several previously unnoticed data mining operations are run against the collected data sets to obtain the previously undetected patterns in this step. Depending upon predetermined criteria, the respective module must be able to notify and alert the appropriate authority of any security incidence in progress.