0


(数据挖掘 —— 无监督学习(聚类)

数据挖掘 —— 无监督学习(聚类)

1. K-means

K-Means为基于切割的聚类算法

1.1 生成指定形状的随机数据

  1. import numpy as np
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. from sklearn.cluster import KMeans
  5. # *************** 生成指定形状的随机数据 *****************
  6. from sklearn.datasets import make_circles,make_moons,make_blobs
  7. n_samples =1000
  8. # 生成环装数据
  9. circles =make_circles(n_samples = n_samples,factor =0.5,noise =0.05)"""
  10. n_samples: 为样本点个数
  11. factor:为大圆与小圆的间距
  12. """
  13. # 生成月牙形数据
  14. moons =make_moons(n_samples = n_samples,noise =0.05)
  15. # 生成簇状数据
  16. blobs =make_blobs(n_samples = n_samples,random_state =100,center_box =(-10,10),cluster_std =1,centers =3)"""
  17. random_state: 随机数种子,多少代保持随机数不变
  18. center_box: 中心确定后的数据边界 默认(-10,10)
  19. cluster_std:数据分布的标准差,决定各类数据的紧凑程度,默认为1.0
  20. centers:产生数据点中心的个数 默认为3"""
  21. # 产生随机数
  22. random_data = np.random.rand(n_samples,2),np.array([0for i inrange(n_samples)])
  23. datasets =[circles,moons,blobs,random_data]
  24. fig = plt.figure(figsize=(20,8))

1.2 进行聚类

  1. colors ="rgbykcm"for index,data inenumerate(datasets):X= data[0]
  2. Y_old = data[1]
  3. km_cluster =KMeans(n_clusters =2)
  4. km_cluster.fit(X)
  5. Y_new = km_cluster.labels_
  6. fig.add_subplot(2,len(datasets),index+1)[plt.scatter(X[i,0],X[i,1],color = colors[Y_old[i]])for i inrange(len(X[:,0]))]
  7. fig.add_subplot(2,len(datasets),index+5)[plt.scatter(X[i,0],X[i,1],color = colors[Y_new[i]])for i inrange(len(X[:,0]))]

1.3 结果

在这里插入图片描述

2. 系统聚类

2.1 代码

  1. AgglomerativeClustering(n_clusters,affinity,linkage)
  • affinity:
  1. “euclidean”,欧几里得距离
  2. “l1”, “l2”,
  3. “manhattan”, 曼哈顿距离
  4. “cosine”, 余弦距离
  5. “precomputed”预输入 需要输出距离矩阵
  • linkage:{“ward”, “complete”, “average”, “single”}, default=”ward”
  1. from sklearn.datasets import make_circles,make_blobs,make_moons
  2. from sklearn.cluster import AgglomerativeClustering
  3. import matplotlib.pyplot as plt
  4. import numpy as np
  5. import pandas as pd
  6. # 准备数据
  7. n_samples =int(1e3)
  8. circles =make_circles(n_samples = n_samples,noise =0.05,factor =0.5,random_state =10)
  9. moons =make_moons(n_samples = n_samples,noise =0.05,random_state =10)
  10. blobs =make_blobs(n_samples=n_samples,centers =4,cluster_std =0.1,center_box =(-1,1),random_state =10)
  11. np.random.seed(10)
  12. random_data =(np.random.rand(n_samples,2),np.zeros((n_samples)).astype(np.int))
  13. datasets =[circles,moons,blobs,random_data]
  14. fig = plt.figure(figsize =(20,8),dpi =72)
  15. colors ="rgbk"for index,data inenumerate(datasets):X= data[0]Y= data[1]
  16. agg_cluster =AgglomerativeClustering(n_clusters =2,affinity ="euclidean",linkage ="average")
  17. Y_predict = agg_cluster.fit(X).labels_
  18. fig.add_subplot(2,len(datasets),index +1)[plt.scatter(X[i,0],X[i,1],color = colors[Y[i]])for i inrange(len(X[:,0]))]
  19. fig.add_subplot(2,len(datasets),index +5)[plt.scatter(X[i,0],X[i,1],color = colors[Y_predict[i]])for i inrange(len(X[:,0]))]

2.2 结果

在这里插入图片描述

3 DBSCAN

3.1 参数选择

  1. 半径:k距离帮助设置半径,也就是要找到突变点, 即选中一个点,计算它和所有其他点的距离, 从小到大排序,发现距离突变点。 需要做大量实验观察。
  2. MinPts:先设置偏小一些,然后进行多次尝试

3.2 代码

  1. # 导入聚类数据
  2. n_samples =1000
  3. from sklearn.datasets import make_circles,make_moons,make_blobs
  4. from sklearn.cluster importDBSCANimport pandas as pd
  5. import numpy as np
  6. import matplotlib.pyplot as plt
  7. circles =make_circles(n_samples = n_samples,noise =0.05,factor =0.5,random_state =10)
  8. moons =make_moons(n_samples = n_samples,noise =0.05,random_state =10)
  9. blobs =make_blobs(n_samples = n_samples,centers =3,cluster_std =0.1,center_box =(-1,1),random_state =10)
  10. np.random.seed(10)
  11. random_data =(np.random.rand(n_samples,2),np.zeros((n_samples)).astype(np.int))
  12. datasets =[circles,moons,blobs,random_data]
  13. fig = plt.figure(figsize =(20,8),dpi =72)
  14. colors ="rgbky"for index,data inenumerate(datasets):X= data[0]
  15. Y_old = data[1]
  16. dbscan_model =DBSCAN(eps =0.1,min_samples =20)
  17. dbscan_model.fit(X)
  18. Y_new = dbscan_model.labels_
  19. fig.add_subplot(2,len(datasets),index+1)[plt.scatter(X[i,0],X[i,1],color = colors[Y_old[i]])for i inrange(len(X[:,0]))]
  20. plt.title("original algorithm")
  21. fig.add_subplot(2,len(datasets),index +5)[plt.scatter(X[i,0],X[i,1],color = colors[Y_new[i]])for i inrange(len(X[:,0]))]
  22. plt.title("DBSCA algorithm")

3.3 结果

在这里插入图片描述

by CyrusMay 2022 04 05


本文转载自: https://blog.csdn.net/Cyrus_May/article/details/123971463
版权归原作者 CyrusMay 所有, 如有侵权,请联系我们删除。

“(数据挖掘 —— 无监督学习(聚类)”的评论:

还没有评论