高永兵1 熊振华1
(内蒙古科技大学信息工程学院内蒙古 包头 014010)1
关键词:专业个人微博;LDA;相似度;事件提取
中图分类号:TP399 文献标识码:A
摘要:为了自动识别出博主的专业兴趣活动,提出了一种基于LDA的专业个人微博事件提取算法。该算法利用改进的TF-IDF来提取特征词以及采用LDA为语料库建模,以此来挖掘出不同主题和词之间的关系,使得权重较大的词更能反映出微博主题。进一步得到了微博在各主题下的概率分布,并结合时间相似度来计算微博之间的综合相似度,最后通过改进的K-Means聚类并与人工数据作比对。实验结果验证了该算法的有效性,同时表明了该算法可以有结构、有条理的呈现出人们感兴趣的微博事件。
TechnicalAlgorithm to extract Individual Micro-blog Events on the Basis of LDA
GAO yong-bing1 XIONG zhen-hua1
(Information Engineering School , Inner MongoliaUniversity of Science and Technology, Baotou Inner Mongolia 014010, China)1
Keyword :Professional individual micro-blog; LDA; Similarity;Event extraction
Abstract :A technical algorithm was proposed on the basis of LDA to extractprofessional individual micro-blog events in order to identify the certaininterest of professional bloggers automatically. This algorithm will use theimproved TF-IDF algorithm to pick up key words. Besides, it can rely on the LDAmodeling corpus to dig out the various relationship between certain themes andrelevant words. In this case, the words of more weight can reflect the blogtheme more distinctly, which will not only be helpful to find out theprobability of different micro blogs under the identical theme, but also tocalculate the proximity of diverse blogs in the light of time similarity.Ultimately, the updated K-Means clustering was used to make a comparisonbetween the final data and the artificial data, turning out that there ishardly any errors. As shown by the experimental results , it is both pragmaticand efficient to use this algorithm to present the structure of theseinteresting blog aggregates logically.