1. k近邻模型

k 近邻法,k-nearest neighbor, k-NN,是一种基本的分类与回归的算法。其三大要素:k的选取、距离判别公式、分类决策.

k 近邻法模型:

math?formula=N_k(x)代表与 x 最近邻的 k 个点的邻域。

math?formula=y%20%3D%20%5Carg%20%5Cmax%20_%7Bc_%7Bj%7D%7D%20%5Csum%5Climits_%7Bx_j%20%5Cin%20N_k(x)%7D%20I(y_i%20%3D%20c_j)

1、k取值

取值小,结构复杂,相似误差小,但容易过拟合;

取值大,结构简单,相似误差大。

在应用中,k 一般选择较小的值,可通过交叉验证来确定最佳的 k 值。此外,k 一般取奇数,防止出现类别数相等的情况。

2、距离度量

一般使用欧氏距离,或者更一般的

math?formula=L_p距离。

math?formula=L_p(x_1%2C%20x_2)%20%3D%20(%5Csum%5Climits_l%20%7Cx_i%5El%20-%20x_j%5El%7C%5Ep)%5E%7B1%2Fp%7D

这里,

math?formula=l 代表特征维度.

3、分类决策

多数表决,即k个最近点类别多数的那个类别。

2. kd树

kd 树,k-dimensional tree,一种分割 k 维度数据空间的数据结构。

创建kd树:

1、选择当前维度下数据的中位数对应的样本当作父节点,从而,划分剩余数据划分到左、右子树;

2、选取当前维度的下一个维度,对左、右子树重复操作 1。

最近邻搜索:

1、搜索目标节点在kd 树对应的“最佳叶节点”。

具体是从根节点出发,根据目标节点在相应维度下的值进行划分,直到划分到叶子结点。

2、向上遍历。

具体是从“最佳叶节点”出发,如果当前点比“最佳叶节点”更靠近目标节点,则该节点为当前最佳点,并且检查该节点的兄弟节点;

直到根节点,搜索结束。

3. 实践

3.1暴力法

使用暴力法进行实现,

math?formula=N 个样本,

math?formula=D维数据建模的算法复杂度

math?formula=O(DN%5E2),因为计算距离时复杂度

math?formula=O(DN), 找出k个最领域时复杂度

math?formula=O(N)

from sklearn import datasets

import numpy as np

from sklearn.model_selection import train_test_split

## Example 1: iris for classification( 3 classes)

# X, y = datasets.load_iris(return_X_y=True)

# Example 2

X, y = datasets.load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# my k-NN

class KNN():

def __init__(self,data,K=3):

self.X = data[0]

self.y = data[1]

self.K = K

def fit(self, X_test):

diffX = np.repeat(X_test, len(self.X), axis=0) - np.tile(self.X,(len(X_test),1))

square_diffX = (diffX**2).sum(axis=1).reshape(len(X_test),len(self.X))

sorted_index = square_diffX.argsort()

predict = [0 for _ in range(len(X_test))]

for i in range(len(X_test)):

class_count={}

for j in range(self.K):

vote_label = self.y[sorted_index[i][j]]

class_count[vote_label] = class_count.get(vote_label,0) + 1

sorted_count = sorted(class_count.items(), key=lambda x: x[1],reverse=True)

predict[i] = sorted_count[0][0]

return predict

def predict(self, X_test):

return self.fit(X_test)

def score(self,X,y):

y_pred = self.predict(X)

return 1 - np.count_nonzero(y-y_pred)/len(y)

knn = KNN((X_train,y_train), K=3)

print(knn.score(X_test,y_test))

运行结果:0.9415204678362573.

使用sklearn的API:

# sklearn API

from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=3)

neigh.fit(X_train, y_train)

print(neigh.score(X_test,y_test))

运行结果:0.9415204678362573

只要kNN法在实现上 k 的选择,以及距离的定义一致,结果是完全相同的。

3.2 kd树

定义kd 树。时间复杂度:

math?formula=O%5BD%20N%20%5Clog%20(N)%5D

# kd-tree

class KDNode:

'''

vaule: [X,y]

'''

def __init__(self, value=None, parent=None, left=None, right=None, index=None):

self.value = value

self.parent = parent

self.left = left

self.right = right

@property

def brother(self):

if not self.parent:

bro = None

else:

if self.parent.left is self:

bro = self.parent.right

else:

bro = self.parent.left

return bro

class KDTree():

def __init__(self,K=3):

self.root = KDNode()

self.K = K

def _build(self, data, axis=0,parent=None):

'''

data:[X,y]

'''

# choose median point

if len(data) == 0:

root = KDNode()

return root

data = np.array(sorted(data, key=lambda x:x[axis]))

median = int(len(data)/2)

loc = data[median]

root = KDNode(loc,parent=parent)

new_axis = (axis+1)%(len(data[0])-1)

if len(data[:median,:]) == 0:

root.left = None

else:

root.left = self._build(data[:median,:],axis=new_axis,parent=root)

if len(data[median+1:,:]) == 0:

root.right = None

else:

root.right = self._build(data[median+1:,:],axis=new_axis,parent=root)

self.root = root

return root

def fit(self, X, y):

# concat X,y

data = np.concatenate([X, y.reshape(-1,1)],axis=1)

root = self._build(data)

def _get_eu_distance(self,arr1:np.ndarray, arr2:np.ndarray) -> float:

return ((arr1 - arr2) ** 2).sum() ** 0.5

def _search_node(self,current,point,result={},class_count={}):

# Get max_node, max_distance.

if not result:

max_node = None

max_distance = float('inf')

else:

# find the nearest (node, distance) tuple

max_node, max_distance = sorted(result.items(), key=lambda n:n[1],reverse=True)[0]

node_dist = self._get_eu_distance(current.value[:-1],point)

if len(result) == self.K:

if node_dist < max_distance:

result.pop(max_node)

result[current] = node_dist

class_count[current.value[-1]] = class_count.get(current.value[-1],0) + 1

class_count[max_node.value[-1]] = class_count.get(max_node.value[-1],1) - 1

elif len(result) < self.K:

result[current] = node_dist

class_count[current.value[-1]] = class_count.get(current.value[-1],0) + 1

return result,class_count

def search(self,point):

# find the point belongs to which leaf node(current).

current = self.root

axis = 0

while current:

if point[axis] < current.value[axis]:

prev = current

current = current.left

else:

prev = current

current = current.right

axis = (axis+1)%len(point)

current = prev

# search k nearest points

result = {}

class_count={}

while current:

result,class_count = self._search_node(current,point,result,class_count)

if current.brother:

result,class_count = self._search_node(current.brother,point,result,class_count)

current = current.parent

return result,class_count

def predict(self,X_test):

predict = [0 for _ in range(len(X_test))]

for i in range(len(X_test)):

_,class_count = self.search(X_test[i])

sorted_count = sorted(class_count.items(), key=lambda x: x[1],reverse=True)

predict[i] = sorted_count[0][0]

return predict

def score(self,X,y):

y_pred = self.predict(X)

return 1 - np.count_nonzero(y-y_pred)/len(y)

def print_tree(self,X_train,y_train):

height = int(math.log(len(X_train))/math.log(2))

max_width = pow(2, height)

node_width = 2

in_level = 1

root = self.fit(X_train,y_train)

from collections import deque

q = deque()

q.append(root)

while q:

count = len(q)

width = int(max_width * node_width / in_level)

in_level += 1

while count>0:

node = q.popleft()

if node.left:

q.append(node.left )

if node.right:

q.append(node.right)

node_str = (str(node.value) if node else '').center(width)

print(node_str, end=' ')

count -= 1

print("\n")

kd = KDTree()

kd.fit( X_train, y_train)

print(kd.score(X_test,y_test))

运行结果:0.9590643274853801

Q1: 为什么结果和上述暴力法的不一样?

A1: 应该是在向上回溯的过程,这里对每个节点的兄弟节点进行计算距离,并且如果兄弟节点是当前最佳节点,并没有继续向下搜索了。正确做法应该是,如果当前节点是当前最佳点,并对兄弟节点搜索,如果兄弟节点成为最佳点,还得对该兄弟节点进行向下搜索(我猜的)。

比较kd 树和蛮力法的运行效率

自定义样本集

# Example 3

X, y = datasets.make_blobs(n_samples=10000, centers=3,

n_features=3, random_state=0)

import time

# 暴力法

st = time.time()

knn = KNN((X_train,y_train), K=3)

print(knn.score(X_test,y_test))

et = time.time()

print("use", et-st)

# kd tree

st = time.time()

kd = KDTree()

kd.fit( X_train, y_train)

print(kd.score(X_test,y_test))

et = time.time()

print("use", et-st)

运行结果:

# 暴力法

0.9993333333333333

use 2.8174679279327393

# kd tree

0.9973333333333333

use 0.7757344245910645

在精度相近下, kd-tree的运行时间是更短的。

参考:

Logo

欢迎加入 MCP 技术社区!与志同道合者携手前行,一同解锁 MCP 技术的无限可能!

更多推荐