首页 > 最新目录 > 正文

12.基于Hadoop平台的K-means聚类算法优化研究

日期:2016-10-14 20:28:56 点击:

                                              卢胜宇1,2,王静宇1,张晓琳1,高俊峰1

                                   (1内蒙古科技大学 信息工程学院,内蒙古自治区 包头 014010   2神华乌海能源西来峰煤化工公司 ,内蒙古自治区 乌海 016031)

关键字:Hadoop;Canopy算法;聚类算法;

中图分类号: TP391   文献标识码: A

摘要:针对传统的串行K-means聚类算法在处理海量数据时出现性能及初始聚类中心敏感等问题,提出了一种基于Hadoop平台的并行CK-means聚类算法,该算法采用Canopy算法和余弦相似度度量来改善K-means算法在确定初始聚类中心的盲目性,采用并行计算框架对算法并行扩展,使之能够适应海量数据处理。实验表明,基于Hadoop平台的CK-means并行算法具有更好的聚类质量,在处理海量数据时具有良好的加速比和可扩展性。

Optimization Algorithm of K-means Clustering Based on Hadoop

                                      Lu Sheng-yu1,2,Wang Jing-yu1,Zhang Xiao-lin1,GaoJun-Feng1

(1.1School of Information engineering , Inner Mongolia University of Science&Technology, Baotou, Inner Mongolia ,China;2.ShenhuaWuHai Energy XiLaiFeng Coal Chemical Co.,LTD, Wuhai, Inner Mongolia ,China)

Key words:Hadoop;Canopy Algorithm;Clustering Algorithm

Abstract:The performance of traditional serial K-means clustering algorithm cannot meet the needs in processingbig data and it maybe result in thesensitive problem of clustering center. To enhance the performance of K-means,the CK-means parallel algorithm is proposed based on hadoop platform.Thealgorithmtake advantage of canopyand cosine distance metric to decrease the blindness of K-means algorithm when finding the initial clustering center.The parallel computing frame isapplied to the algorithm for processing the big data. The test results show that the CK-means parallel algorithm based on Hadoop platform has better clustering quality, speedup and expansibility when processing big data.

地址:内蒙古包头市昆都仑区阿尔丁大街7号 邮编:014010 电话:0472-5951610或0472-5953910 Email:cky@imust.edu.cn nkdxb@imust.edu.cn

版权所有:内蒙古科技大学学报编辑部(©2013)