What are K-Means and KNN algorithms?
K-Means is an unsupervised machine learning algorithm used for classification problems whereas KNN is a supervised machine learning algorithm that can be used to solve both classification and regression problems. Both of these algorithms heavily depend on the value of K that needs to be known beforehand.
1. K-Means:
As mentioned earlier, K-Means is an unsupervised machine learning algorithm that is used to solve classification problems. The K represents the total number of available clusters. The following steps are carried out to classify the available unlabeled data points into K clusters.
- Step1: Decide the value of K before proceeding with the algorithm. The optimal K value can be found by using the Elbow method. The elbow method is a graph between the number of clusters and the wcss (within clusters sum of squares) value.
- Step2: Once the value of K has been decided, create K number of clusters with K centroids.
- Step3: Assign each data point to its nearest cluster. The nearest cluster is found by calculating the distance between the data point and the cluster centroid. In the below figure, point A will be assigned to the centroid C as the distance between A-C is smaller compared to the distance between A-D
The distance is calculated by using the Euclidean formula which is as follows for point P(x1,y1) and point Q(x2,y2)
- Step4: Once all the data points have been classified into the K clusters, new centroids are chosen within each cluster by considering the mean distance between them.
- Step5: Repeat Step3 and Step4 until there is no change in the clusters.
2. KNN:
KNN is an abbreviation for K Nearest Neighbours. It is a supervised machine learning technique used to solve classification and regression problems. The K represents the total number of data points that need to be considered around the data that needs to be processed.
To solve classification problems, the distance between the unlabelled data point and the nearest K data points is calculated and the data point is classified to the class having the majority of data points in this selection. For example, consider the below example where the green data point will be classified into the blue color class as the number of blue data points is greater than the number of orange data points when K=3
To solve regression problems, the average value of the K data points will be considered as the output prediction.
The K nearest data points can be computed by using either of the below two distances