R語言中K鄰近算法的初學者指南：從菜鳥到大神（附代碼＆連結）

作者：Leihua Ye, UC Santa Barbara

翻譯：陳超

校對：馮羽

本文約2300字，建議閱讀10分鐘

本文介紹了一種針對初學者的K臨近算法在R語言中的實現方法。

本文呈現了一種在R語言中建立起KNN模型的方式，其中包含了多種測量指標。

Mathyas Kurmann拍攝，來自於Unsplash

「如果你有5分鐘時間可以離開比爾·蓋茨生活，我敢打賭你很富有。」

背景

在機器學習的世界裡，我發現K鄰近算法（KNN）分類器是最直觀、最容易上手的，甚至不需要引入任何數學符號。

為了決定觀測樣本的標籤，我們觀察它的鄰近樣本們並把鄰近樣本們的標籤貼給感興趣的觀測樣本。當然，觀察一個鄰近樣本可能會產生偏差和錯誤，KNN方法就制定了一系列的規則和流程來決定最優化的鄰近樣本數量，比如，檢驗k>1的鄰近樣本並且採納取大多數的規則來決定分類。

「為了決定新觀測樣本的標籤，我們就看最鄰近樣本。」

距離度量

為了選擇最鄰近的樣本，我們必須定義距離的大小。對於類別數據，有漢明距離和編輯距離。詳情請見

https://en.m.wikipedia.org/wiki/Knearest_neighbors_algorithm

，本文將不會過多討論數學問題。

什麼是K折交叉驗證？

在機器學習當中，交叉驗證（CV）在模型選擇中起著關鍵作用，並且擁有一系列的應用。事實上，CV有著更加直觀的設計理念，並且也很直觀。

簡要介紹如下：

1. 將數據分成K個均勻分布的塊/層

2. 選擇一個塊/層集作為測試集，剩下的K-1塊/層作為訓練集

3. 基於訓練集建立ML模型

4. 僅比較測試集當中的預測值和真實值

5. 將ML模型應用到測試集，並使用每個塊重複測試K次

6. 把模型的度量得分加和並求K層的平均值

如何選擇K？

如同你注意到的，交叉驗證比較的一點是如何為K設置值。我們記總樣本量為n。從技術上來看，K可設置從1到n的任意值。

如果k=n,我們取出1個觀測值作為訓練集並把剩餘的n-1個值作為測試集。然後在整個數據集中重複這個過程。這就叫做「留一交叉驗證法」（LOOCV）。

留一交叉驗證法要求較大的計算力，並且如果你的數據集過大，該法可能會無法終止。

退一步來講，即使沒有最優k值，也不能說k值越大更好。

為了選擇最合適的k值，我們必須在偏差和方差之間權衡。如果k很小，我們在估計測試誤差時會獲得較大的偏差但方差會較小；如果k值比較大，我們的偏差會較小，方差會較大。

Jon Tyson拍攝，來自於Unsplash

「你好鄰居！快進來吧。」

R語言實現

1. 軟體準備

# install.packages(「ISLR」)

# install.packages(「ggplot2」) # install.packages(「plyr」)

# install.packages(「dplyr」) # install.packages(「class」)# Load libraries

library(ISLR)

library(ggplot2)

library(reshape2)

library(plyr)

library(dplyr)

library(class)# load data and clean the dataset

banking=read.csv(「bank-additional-full.csv」,sep=」;」,header=T)##check for missing data and make sure no missing data

banking[!complete.cases(banking),]#re-code qualitative (factor) variables into numeric

banking$job=recode(banking$job,「『admin.』=1;』blue-collar』=2;』entrepreneur』=3;』

housemaid』=4;』management』=5;』retired』=6;’self-employed』=7;’services』=8;

’student』=9;』technician』=10;』unemployed』=11;』unknown』=12」)#recode variable again

banking$marital=recode(banking$marital,「『divorced』=1;』married』=2;’single』=3;』unknown』=4」)

banking$education=recode(banking$education,「『basic.4y』=1;』basic.6y』=2;』basic.9y』=3;』high.school』=4;』illiterate』=5;』professional.course』=6;』university.degree』=7;』unknown』=8」)

banking$default = recode(banking$default, 「『no』=1;』yes』=2;』unknown』=3」)

banking$housing = recode(banking$housing, 「『no』=1;』yes』=2;』unknown』=3」)

banking$loan=recode(banking$loan,「『no』=1;』yes』=2;』unknown』=3」)banking$contact=recode(banking$loan,「『cellular』=1;』telephone』=2;」)

banking$month=recode(banking$month,「『mar』=1;』apr』=2;』may』=3;』jun』=4;』jul』=5;』aug』=6;’sep』=7;』oct』=8;』nov』=9;』dec』=10」)

banking$day_of_week=recode(banking$day_of_week,「『mon』=1;』tue』=2;』wed』=3;』thu』=4;』fri』=5;」)

banking$poutcome = recode(banking$poutcome,「『failure』=1;』nonexistent』=2;’success』=3;」)#remove variable 「pdays」, b/c it has no variation

banking$pdays=NULL #remove variable 「duration」, b/c itis collinear with the DV

banking$duration=NULL

在加載並清空初始數據集之後，通常的做法是將變量的分布可視化，檢查季節性，模式，異常值，等等。

#EDA of the DV

plot(banking$y,main="Plot 1: Distribution of Dependent Variable")

如圖所示，結果變量（銀行服務訂閱）並不滿足均勻分布，「否」比「是」多得多。

當我們盡力想正確分類標籤的時候，監督學習是不太方便的。正如意料之中，如果大量的少數案例被分類為多數標籤，假陽性的比率會變高。

事實上，不均勻分布可能會更偏好非參數ML分類器，在我的另一篇文章（使用5個分類器對罕見事件進行分類，https://medium.com/m/global-identity?

redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fclassifying-rare-events-using-five-machine-learning-techniques-fab464573233）中介紹了KNN在與其他ML方法進行比較之後表現得更好。這個可能是參數和非參數模型中潛在的數學和統計假設導致的。

2. 數據分組

如上所述，我們需要將數據集進行分組，分為訓練集和測試集，並採取k層交叉驗證來選擇最佳的ML模型。根據經驗法則，我們通常使用「80-20」比：我們用80%的數據訓練ML用剩餘20%進行測試。而時間序列數據略有不同，我們將比例改為90%對10%。

#split the dataset into training and test sets randomly, but we need to set seed so as to generate the same value each time we run the codeset.seed(1)#create an index to split the data: 80% training and 20% test

index = round(nrow(banking)*0.2,digits=0)#sample randomly throughout the dataset and keep the total number equal to the value of index

test.indices = sample(1:nrow(banking), index)#80% training set

banking.train=banking[-test.indices,] #20% test set

banking.test=banking[test.indices,] #Select the training set except the DV

YTrain = banking.train$y

XTrain = banking.train %>% select(-y)# Select the test set except the DV

YTest = banking.test$y

XTest = banking.test %>% select(-y)

到目前為止，我們已經完成了數據準備並開始模型選擇。

3. 訓練模型

讓我們編寫一個新的函數（「calc_error_rate」）來記錄錯誤分類率。該函數計算當使用訓練集得到的預測標籤與真正的結果標籤不相匹配的比率。它測量了分類的正確性。

#define an error rate function and apply it to obtain test/training errorscalc_error_rate <- function(predicted.value, true.value){

return(mean(true.value!=predicted.value))

}

然後，我們需要另外一個函數「do.chunk()」來做k層交叉驗證。該函數返回層的可能值的數據框。這一步的主要目的是為KNN選擇最佳的K值。

nfold = 10

set.seed(1)# cut() divides the range into several intervals

folds = seq.int(nrow(banking.train)) %>%

cut(breaks = nfold, labels=FALSE) %>%

sampledo.chunk <- function(chunkid, folddef, Xdat, Ydat, k){

train = (folddef!=chunkid)# training indexXtr = Xdat[train,] # training set by the indexYtr = Ydat[train] # true label in training setXvl = Xdat[!train,] # test setYvl = Ydat[!train] # true label in test setpredYtr = knn(train = Xtr, test = Xtr, cl = Ytr, k = k) # predict training labelspredYvl = knn(train = Xtr, test = Xvl, cl = Ytr, k = k) # predict test labelsdata.frame(fold =chunkid, # k folds

train.error = calc_error_rate(predYtr, Ytr),#training error per fold

val.error = calc_error_rate(predYvl, Yvl)) # test error per fold

}# set error.folds to save validation errors

error.folds=NULL# create a sequence of data with an interval of 10

kvec = c(1, seq(10, 50, length.out=5))set.seed(1)for (j in kvec){

tmp = ldply(1:nfold, do.chunk, # apply do.function to each fold

folddef=folds, Xdat=XTrain, Ydat=YTrain, k=j) # required arguments

tmp$neighbors = j # track each value of neighbors

error.folds = rbind(error.folds, tmp) # combine the results

}#melt() in the package reshape2 melts wide-format data into long-format data

errors = melt(error.folds, id.vars=c(「fold」,」neighbors」), value.name= 「error」)

接下來的一步是為了找到使得驗證錯誤最小化的k值。

val.error.means = errors %>%

#select all rows of validation errors

filter(variable== 「val.error」 ) %>%

#group the selected data by neighbors

group_by(neighbors, variable) %>%

#cacluate CV error for each k

summarise_each(funs(mean), error) %>%

#remove existing grouping

ungroup() %>%

filter(error==min(error))#the best number of neighbors

numneighbor = max(val.error.means$neighbors)

numneighbor## [20]

在使用10層交叉驗證之後，最優的鄰近值數為20。

Nick Youngson

4. 一些模型的度量

#training error

set.seed(20)

pred.YTtrain = knn(train=XTrain, test=XTrain, cl=YTrain, k=20)

knn_traing_error<-calc_error_rate (predicted.value=pred.YTtrain,true.value=YTrain)

knn_traing_error

[1] 0.101214

訓練誤差為0.1。

#test error

set.seed(20)

pred.YTest = knn(train=XTrain, test=XTest, cl=YTrain, k=20)

knn_test_error <- calc_error_rate(predicted.value=pred.YTest, true.value=YTest)

knn_test_error

[1] 0.1100995

測試誤差為0.11。

#confusion matrixconf.matrix = table(predicted=pred.YTest, true=YTest)

基於以上的混淆矩陣（confusion matrix），我們可以計算以下的值並且準備好畫出ROC曲線。

Accuracy = (TP +TN)/(TP+FP+FN+TN)

TPR/Recall/Sensitivity = TP/(TP+FN)

Precision = TP/(TP+FP)

Specificity = TN/(TN+FP)

FPR = 1 — Specificity = FP/(TN+FP)

F1 Score = 2*TP/(2*TP+FP+FN) = Precision*Recall /(Precision +Recall)

# Test accuracy ratesum(diag(conf.matrix)/sum(conf.matrix))[1] 0.8899005# Test error rate1 - sum(drag(conf.matrix)/sum(conf.matrix))[1] 0.1100995

你可能會注意到，測試正確率+測試錯誤率=1，我也提供了多種方法來計算每個值。

# ROC and AUC

knn_model = knn(train=XTrain, test=XTrain, cl=YTrain, k=20,prob=TRUE)prob <- attr(knn_model, 「prob」)prob <- 2*ifelse(knn_model == 「-1」, prob,1-prob) — 1pred_knn <- prediction(prob, YTrain)performance_knn <- performance(pred_knn, 「tpr」, 「fpr」)# AUCauc_knn <- performance(pred_knn,」auc」)@y.valuesauc_knn[1] 0.8470583plot(performance_knn,col=2,lwd=2,main=」ROC Curves for KNN」)

綜上所述，我們學習了什麼是KNN並且在R語言當中建立了KNN模型。更重要的是，我們已經學到了K層交叉驗證法背後的機制以及如何在R語言中實現交叉驗證。

作者簡介：

雷華·葉(@leihua_ye)是加州大學聖巴巴拉分校的博士生。他在定量用戶體驗研究、實驗與因果推理、機器學習和數據科學方面有5年以上的研究和專業經驗。

原文標題：

Beginner’s Guide to K-Nearest Neighbors in R: from Zero to Hero

原文連結：

https://www.kdnuggets.com/2020/01/beginners-guide-nearest-neighbors-r.html

編輯：於騰凱

校對：譚佳瑤

譯者簡介

陳超，北京大學應用心理碩士在讀。本科曾混跡於計算機專業，後又在心理學的道路上不懈求索。越來越發現數據分析和編程已然成為了兩門必修的生存技能，因此在日常生活中盡一切努力更好地去接觸和了解相關知識，但前路漫漫，我仍在路上。

—完—

關注清華-青島數據科學研究院官方微信公眾平台「 THU數據派 」及姊妹號「 數據派THU 」獲取更多講座福利及優質內容。