首页 > 最新目录 > 正文

19.基于MapReduce 的并行遮盖文本聚类算法*

日期:2013-09-15 13:00:00 点击:

.

 张亚楠1,谭跃生2

(1.内蒙古科技大学信息工程学院,内蒙古包头014010;

2. 内蒙古科技大学工程训练中心,内蒙古包头014010)

关键词:文本聚类;遮盖算法;Hadoop;Mapeduce

中图分类号:TP391 1 文献标识码:A

摘要:通过研究Hadoop 平台和Mapeduce 编程框架,提出了一个基于Mapeduce 的并行遮盖文本聚类算法,遮盖算法提出了两个距离阈值T1T2 用来构建重叠子集,避免了传统聚类算法对噪声敏感的缺点. 同时采用适当的快速近似距离度量,大大加快了聚类速度. 实验表明该算法在Mapeduce 框架下有良好的集群加速性能,适合处理大规模的数据集.

The parallel canopy algorithm for text clustering based on Mapeduce

ZHANG Ya-nan1TAN Yue-sheng2

(1.Information Engineering SchoolInner Mongolia University of Science and TechnologyBaotou 014010China;

2 Engineering and Training CenterInner Mongolia University of Science and TechnologyBaotou 014010China)

Key words:document clustering;canopy algorithm;hadoop ;mapreduce

Abstract:By researching Hadoop platform and Mapeduce programming frameworka canopy algorithm for text clustering based on Mapeduce was presented This algorithm proposed two distance threshold T1 and T2 to build overlapping subset It can avoid the shortcomings of the traditional clustering algorithm which is sensitive to noise At the same timethis algorithm uses an appropriate fast approximate distance metrics and accelerates the clustering speed greatly The experiments show that it has a good acceleration performance with Mapeduce frameworkso the algorithm is suitable for handling large data sets

地址:内蒙古包头市昆都仑区阿尔丁大街7号 邮编:014010 电话:0472-5951610或0472-5953910 Email:cky@imust.edu.cn nkdxb@imust.edu.cn

版权所有:内蒙古科技大学学报编辑部(©2013)