3.3.2 HTML Parsing and Text Extraction

View Post

It is necessary at this point to clarify why LS extraction cannot be applied directly on the raw web page which are downloaded in full size without any parser. The first is, other than pure text information retrieval, the web pages have their unique feature, HTML tags, which help to construct page template, font format, font size, images insertion and other components for a fancy appearance. However, these good looking gadgets in the web pages actually are the sources of distractions and interferences when the applications are trying to analyze them. Because only the showing text part in a web page is useful in common sense. How to transfer the HTML page to pure text by removing all kinds of hidden tags is a key issue to the following steps and decide the final results. The text in the page must be all extracted at first, meanwhile, the tags information behind the text can not be simply discarded, for example, in Michal’s research, she classified and saved the text into 6 different categories, each category takes a unique weight. The second is, the link information is also a powerful hint in deciding the unique feature of a particular web page. For example, the commercial search engines largely depend on the algorithms like page-rank and authority and hubs. Even for searching and retrieval studies in academic papers, the citation rank algorithm is also widely accepted. However, not same like academic papers, which contain the citations in the end of each paper as a references chapter, web pages’ link information hides in the anchor tags, which leads to more complicated data-source preprocessing before LS extraction. Construct a query with extracting the link information, such as the domain that the page belongs to, combined with LS could be another study but not included in this report.

posted on 2009-06-18 07:42 JosephQuinn 阅读(230) 评论(0) 编辑收藏所属分类: My Master-degree Project

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: Appendix B 5 Conclusion 4.7 Sentence Rank on Yahoo News Page 4.6 Sentence Rankv 4.5 Random pick sentence 4.4 Word Rank 4.3 Google search tips: meta keys and meta description 4.2 Title 4.1 The basics 3.5 Deep Web Search Engine

Avenue U

常用链接

留言簿

随笔分类

随笔档案

Core Java

最新随笔

搜索

最新评论

阅读排行榜

评论排行榜

View Post

3.3.2 HTML Parsing and Text Extraction