【机器学习】西瓜书_周志华,习题9.4,编程实现k均值算法+绘图(python)
【机器学习】西瓜书_周志华,习题9.4,编程实现k均值算法1. 核心算法:k均值算法作为原型聚类算法的一种,其主要目标最小化平方误差,通过不断迭代更新均值向量,从而得到新的簇分类。(使簇内距离最小)具体算法如上图所示。2. 具体实现:# -*- coding: utf-8 -*-#author: w61import randomimport mathimport matplo...
【机器学习】西瓜书_周志华,习题9.4,编程实现k均值算法
1. 核心算法:
k均值算法作为原型聚类算法的一种,其主要目标最小化平方误差,通过不断迭代更新均值向量,从而得到新的簇分类。(使簇内距离最小)
具体算法如上图所示。
注:数据集附在最后了~
2. 具体实现:
# -*- coding: utf-8 -*-
#author: w61
import random
import math
import matplotlib.pyplot as plt
class Kmeans():
def __init__(self,k,mode):
self.datasets = []
self.vector = 0 #记录数据集的维数
self.k = k #聚类簇数
self.C = [[] for i in range(k)] #簇
self.mode = 0 #初始化方式
self.mean_vector = [] #均值向量
self.update_flag = 0 #记录均值向量是否改变
self.count = 0 #记录循环次数
self.max_count = 500 #最大运行轮数
def LoadData(self,file_name):
"""
读取数据集
:param file_name:数据集的路径
:return:null
"""
with open(file_name) as f:
for contents in f:
contents = contents.replace('\n','')
content = contents.split(' ')
self.datasets.append(content[1:])
self.vector = len(self.datasets[0])
def Inialize(self):
"""
初始化均值向量
:return:null
"""
if self.mode == 0:
means = random.sample(range(0,len(self.datasets)),self.k)
for mean in means:
self.mean_vector.append(self.datasets[mean])
def GetDistance(self,x,y):
"""
计算两个向量(x和y)之间的距离
:param x:
:param y:
:return: dis 距离
"""
dis = 0
for i in range(self.vector):
dis += pow((float(x[i])-float(y[i])),2)
dis = math.sqrt(dis)
return dis
def Clustering(self):
"""
划分簇
:return:null
"""
self.C = [[] for m in range(self.k)] #每一次都要先把C置零
for j in range(0,len(self.datasets)):
min_dis = 0
min_index = 0
for i in range(self.k):
dis = self.GetDistance(self.datasets[j],self.mean_vector[i])
if min_dis == 0 or dis < min_dis:
min_dis = dis
min_index = i
self.C[min_index].append(j)
def Update(self):
"""
更新每个簇的均值向量
:return:
"""
self.update_flag = 0
for i in range(self.k):
means= []
for h in range(self.vector):
sum = 0
for j in self.C[i]:
sum += float(self.datasets[j][h])
if len(self.C[i]) == 0:
mean = 0
else:
mean = sum / len(self.C[i])
means.append(mean)
if means == self.mean_vector[i]:
continue
else:
self.mean_vector[i] = list(means)
self.update_flag = 1
def Plot(self):
"""
绘图
:return:
"""
fig = plt.figure()
ax = fig.add_subplot(111)
colors = ['red','blue','green','black','pink','brown','gray']
for i in range(self.k):
for j in self.C[i]:
x = float(self.datasets[j][0])
y = float(self.datasets[j][1])
plt.scatter(x,y,color = colors[i],marker='+')
for i in range(self.k):
for j in self.mean_vector:
x = float(j[0])
y = float(j[1])
plt.scatter(x,y,color = 'cyan',marker='p')
plt.show()
def Process(self):
"""
执行函数
:return:
"""
self.LoadData('data.txt')
self.Inialize()
while self.count <= self.max_count:
self.Clustering()
self.Update()
if self.update_flag == 0:
break
self.count += 1
print('k-means分类结果如下:')
print(self.C)
self.Plot()
if __name__ == "__main__":
kmeans = Kmeans(5,0)
kmeans.Process()
这次每个函数都有比较详细的注释,多的就不解释啦。
实现效果如图:
(1)5分类:
(2)3分类:
附:
data.txt数据集:
1 0.697 0.460
2 0.774 0.376
3 0.634 0.264
4 0.608 0.318
5 0.556 0.215
6 0.403 0.237
7 0.481 0.149
8 0.437 0.211
9 0.666 0.091
10 0.243 0.267
11 0.245 0.057
12 0.343 0.099
13 0.639 0.161
14 0.657 0.198
15 0.360 0.370
16 0.593 0.042
17 0.719 0.103
18 0.359 0.188
19 0.339 0.241
20 0.282 0.257
21 0.748 0.232
22 0.714 0.346
23 0.483 0.312
24 0.478 0.437
25 0.525 0.369
26 0.751 0.489
27 0.532 0.472
28 0.473 0.376
29 0.725 0.445
30 0.446 0.459
更多推荐
所有评论(0)