K-nearest neighbor classifier is one of the introductory supervised classifier, which every data science learner should be aware of. For simplicity, this classifier is called as Knn Classifier. Generally, Knn address the pattern recognition problems and also the best choices for addressing some of the classification related tasks. The simple version of the K-nearest neighbor classifier algorithms is to predict the target label by finding the nearest neighbor class. The closest class will be identified using the distance measures like Euclidean distance.
Before diving into the k-nearest neighbor, classification process lets’s understand the application-oriented example where we can use the knn algorithm.
Let’s assume a money lending company “XYZ” like UpStart, IndiaLends, etc. Money lending XYZ company is interested in making the money lending system comfortable & safe for lenders as well as for borrowers. The company holds a database of customer’s details.
Using customer’s detailed information from the database, it will calculate a credit score(discrete value) for each customer. The calculated credit score helps the company and lenders to understand the credibility of a customer clearly. So they can simply take a decision whether they should lend money to a particular customer or not.
The customer’s details could be:
Educational background details.
Highest graduated degree.
Cumulative grade points average (CGPA) or marks percentage.
The reputation of the college.
Consistency in his lower degrees.
Whether to take the education loan or not.
Cleared education loan dues.
Employment details.
Salary.
Year of experience.
Got any onsite opportunities.
Average job change duration.
The company(XYZ) use’s these kinds of details to calculate credit score of a customer. The process of calculating the credit score from the customer’s details is expensive. To reduce the cost of predicting credit score, they realized that the customers with similar background details are getting a similar credit score.
So, they decided to use already available data of customers and predict the credit score using it by comparing it with similar data. These kinds of problems are handled by the k-nearest neighbor classifier for finding the similar kind of customers.
Let's take another simple example of knn classifier. For instance, if on the basis of past data, you have found that your friend likes 15 Hollywood movies. Now, a new Hollywood movie “Star War” is released and before buying the tickets for your friend, you want to find out whether your friend will like this movie or not. You can identify this by using KNN classifier before buying the tickets. We can easily get the data from the imdb site and can check the Hollywood movies he rated on top 15.
We can see clearly here that the problem is whether your friend like the movie or not. So, target class should be ‘yes’ or ‘no’ here. The training data would be the past movies liked by my friend. Testing data is the new movie released. The expected features here will be Genres, Actor, Actress, and Director.
KNN classifier is a best choice in this situation because k nearest neighbor performs quickly for small set training datasets and is very simple to execute. It quickly tells me whether my friend will like the movie or not.
How to choose the value of K?
Selecting the value of K in K-nearest neighbor is the most critical problem. A small value of K means that noise will have a higher influence on the result i.e., the probability of overfitting is very high. A large value of K makes it computationally expensive and defeats the basic idea behind KNN (that points that are near might have similar classes ). A simple approach to select k is k = n^(1/2).
To optimize the results, we can use Cross Validation. Using the cross-validation technique, we can test KNN algorithm with different values of K. The model which gives good accuracy can be considered to be an optimal choice.
It depends on individual cases, at times best process is to run through each possible value of k and test our result.
Comments