孟海东,冀小青,肖银龙,宋宇辰
(内蒙古科技大学,内蒙古包头014010)
关键词:
分类;云计算; MapReduce ;随机森林;特征选择
中图分类号:TP311 文献标志码: A
摘要:
针对当前大数据环境下随机森林分类算法在处理不平衡数据集分类任务时存在的小类样本被忽视及效率低的问题,提出了一种 Hadoop 环境下基于 敏感度的随机森林分类算法。该算法引入了文本分类特征选择技术中的相关方法,采用 MapReduce 编程模型,在 Hadoop 云计算平台上实现了算法的并行化。通过实验对比分析了该算法与传统随机森林分类算法对不平衡数据的分类效果。结果表明,该算法显著提高了传统随机森林分类算法的性能,且具有 高效性和易扩展性。
Research on Random Forest Classification Algorithm Based on Sensitivity Degree in Hadoop Environment
Meng Hai-dong,Ji Xiao-qing,Xiao Yin-long, Song Yu-chen
(Inner Mongolia University of Science and Technology,Baotou, 014010,China)
Key words: classification; cloud computing; MapReduce; Random Forest; feature selection
Abstract: When applied to deal with the imbalanced dataset classification task under the circumstance of big data, Random Forest classification algorithm always suffers from the neglect of minority class and inefficiency problem. A Random Forest classification algorithm based on Sensitivity Degree in Hadoop environment is proposed to solve the above-mentioned problems, which introduced the method from feature selection of text classification, and is parallelized by using MapReduce programming model in Hadoop cloud computing environment. Comparison was made through experiments in regard to the effect of the imbalanced dataset classification by this algorithm and by the traditional Random Forest classification algorithm. The experimental results show that this algorithm significantly improves the performance of the traditional Random Forest classification algorithm, and has high efficiency and ease of scalability.