首页 > 自然科学目录 > 正文

12.基于Hadoop平台的K-means聚类算法优化研究

日期：2016-10-14 20:28:56 点击：

卢胜宇1,2，王静宇1，张晓琳1，高俊峰1

（1内蒙古科技大学信息工程学院，内蒙古自治区包头 014010 2神华乌海能源西来峰煤化工公司，内蒙古自治区乌海 016031）

关键字：Hadoop；Canopy算法；聚类算法；

中图分类号: TP391 文献标识码: A

摘要：针对传统的串行K-means聚类算法在处理海量数据时出现性能及初始聚类中心敏感等问题，提出了一种基于Hadoop平台的并行CK-means聚类算法，该算法采用Canopy算法和余弦相似度度量来改善K-means算法在确定初始聚类中心的盲目性，采用并行计算框架对算法并行扩展，使之能够适应海量数据处理。实验表明，基于Hadoop平台的CK-means并行算法具有更好的聚类质量，在处理海量数据时具有良好的加速比和可扩展性。

Optimization Algorithm of K-means Clustering Based on Hadoop

Lu Sheng-yu1,2，Wang Jing-yu1，Zhang Xiao-lin1，GaoJun-Feng1

（1.1School of Information engineering , Inner Mongolia University of Science&Technology, Baotou, Inner Mongolia ,China；2.ShenhuaWuHai Energy XiLaiFeng Coal Chemical Co.,LTD, Wuhai, Inner Mongolia ,China)

Key words：Hadoop;Canopy Algorithm;Clustering Algorithm

Abstract：The performance of traditional serial K-means clustering algorithm cannot meet the needs in processingbig data and it maybe result in thesensitive problem of clustering center. To enhance the performance of K-means，the CK-means parallel algorithm is proposed based on hadoop platform.Thealgorithmtake advantage of canopyand cosine distance metric to decrease the blindness of K-means algorithm when finding the initial clustering center.The parallel computing frame isapplied to the algorithm for processing the big data. The test results show that the CK-means parallel algorithm based on Hadoop platform has better clustering quality, speedup and expansibility when processing big data.