卢胜宇1,2,王静宇1,张晓琳1,高俊峰1
(1内蒙古科技大学 信息工程学院,内蒙古自治区 包头 014010 2神华乌海能源西来峰煤化工公司 ,内蒙古自治区 乌海 016031)
关键字:Hadoop;Canopy算法;聚类算法;
中图分类号: TP391 文献标识码: A
摘要:针对传统的串行K-means聚类算法在处理海量数据时出现性能及初始聚类中心敏感等问题,提出了一种基于Hadoop平台的并行CK-means聚类算法,该算法采用Canopy算法和余弦相似度度量来改善K-means算法在确定初始聚类中心的盲目性,采用并行计算框架对算法并行扩展,使之能够适应海量数据处理。实验表明,基于Hadoop平台的CK-means并行算法具有更好的聚类质量,在处理海量数据时具有良好的加速比和可扩展性。
Optimization Algorithm of K-means Clustering Based on Hadoop
Lu Sheng-yu1,2,Wang Jing-yu1,Zhang Xiao-lin1,GaoJun-Feng1
(1.1School of Information engineering , Inner Mongolia University of Science&Technology, Baotou, Inner Mongolia ,China;2.ShenhuaWuHai Energy XiLaiFeng Coal Chemical Co.,LTD, Wuhai, Inner Mongolia ,China)
Key words:Hadoop;Canopy Algorithm;Clustering Algorithm
Abstract:The performance of traditional serial K-means clustering algorithm cannot meet the needs in processingbig data and it maybe result in thesensitive problem of clustering center. To enhance the performance of K-means,the CK-means parallel algorithm is proposed based on hadoop platform.Thealgorithmtake advantage of canopyand cosine distance metric to decrease the blindness of K-means algorithm when finding the initial clustering center.The parallel computing frame isapplied to the algorithm for processing the big data. The test results show that the CK-means parallel algorithm based on Hadoop platform has better clustering quality, speedup and expansibility when processing big data.