.
张亚楠1,谭跃生2
(1.内蒙古科技大学信息工程学院,内蒙古包头014010;
2. 内蒙古科技大学工程训练中心,内蒙古包头014010)
关键词:文本聚类;遮盖算法;Hadoop;MapReduce
中图分类号:TP391. 1 文献标识码:A
摘要:通过研究Hadoop 平台和MapReduce 编程框架,提出了一个基于MapReduce 的并行遮盖文本聚类算法,遮盖算法提出了两个距离阈值T1,T2 用来构建重叠子集,避免了传统聚类算法对噪声敏感的缺点. 同时采用适当的快速近似距离度量,大大加快了聚类速度. 实验表明该算法在MapReduce 框架下有良好的集群加速性能,适合处理大规模的数据集.
The parallel canopy algorithm for text clustering based on MapReduce
ZHANG Ya-nan1,TAN Yue-sheng2
(1.Information Engineering School,Inner Mongolia University of Science and Technology,Baotou 014010,China;
2. Engineering and Training Center,Inner Mongolia University of Science and Technology,Baotou 014010,China)
Key words:document clustering;canopy algorithm;hadoop ;mapreduce
Abstract:By researching Hadoop platform and MapReduce programming framework,a canopy algorithm for text clustering based on MapReduce was presented. This algorithm proposed two distance threshold T1 and T2 to build overlapping subset. It can avoid the shortcomings of the traditional clustering algorithm which is sensitive to noise. At the same time,this algorithm uses an appropriate fast approximate distance metrics and accelerates the clustering speed greatly. The experiments show that it has a good acceleration performance with MapReduce framework,so the algorithm is suitable for handling large data sets.