Avenue U

posts(42) comments(0) trackbacks(0)
  • BlogJava
  • 联系
  • RSS 2.0 Feed 聚合
  • 管理

常用链接

  • 我的随笔
  • 我的评论
  • 我的参与

留言簿

  • 给我留言
  • 查看公开留言
  • 查看私人留言

随笔分类

  • C++(1)
  • Core Java(2)
  • My Master-degree Project(33)
  • SSH(4)
  • struts2(1)

随笔档案

  • 2009年7月 (1)
  • 2009年6月 (41)

Core Java

最新随笔

  • 1. String Stream in C++
  • 2. Validators in Struts2
  • 3. An Interceptor Example in Strut2-Spring-Hibernate Application
  • 4. 3 Validators in Struts2-Spring-Hibernate
  • 5. Strut2-Spring-Hibernate under Lomboz Eclipse3.3
  • 6. Run Spring by Maven2 in Vista
  • 7. Appendix B
  • 8. 5 Conclusion
  • 9. 4.7 Sentence Rank on Yahoo News Page
  • 10. 4.6 Sentence Rankv

搜索

  •  

最新评论

阅读排行榜

评论排行榜

View Post

3.3.2 HTML Parsing and Text Extraction

It is necessary at this point to clarify why LS extraction cannot be applied directly on the raw web page which are downloaded in full size without any parser. The first is, other than pure text information retrieval, the web pages have their unique feature, HTML tags, which help to construct page template, font format, font size, images insertion and other components for a fancy appearance. However, these good looking gadgets in the web pages actually are the sources of distractions and interferences when the applications are trying to analyze them. Because only the showing text part in a web page is useful in common sense. How to transfer the HTML page to pure text by removing all kinds of hidden tags is a key issue to the following steps and decide the final results. The text in the page must be all extracted at first, meanwhile, the tags information behind the text can not be simply discarded, for example, in Michal’s research, she classified and saved the text into 6 different categories, each category takes a unique weight. The second is, the link information is also a powerful hint in deciding the unique feature of a particular web page. For example, the commercial search engines largely depend on the algorithms like page-rank and authority and hubs. Even for searching and retrieval studies in academic papers, the citation rank algorithm is also widely accepted. However, not same like academic papers, which contain the citations in the end of each paper as a references chapter, web pages’ link information hides in the anchor tags, which leads to more complicated data-source preprocessing before LS extraction. Construct a query with extracting the link information, such as the domain that the page belongs to, combined with LS could be another study but not included in this report.

posted on 2009-06-18 07:42 JosephQuinn 阅读(229) 评论(0)  编辑  收藏 所属分类: My Master-degree Project

新用户注册  刷新评论列表  

只有注册用户登录后才能发表评论。


网站导航:
博客园   IT新闻   Chat2DB   C++博客   博问   管理
相关文章:
  • Appendix B
  • 5 Conclusion
  • 4.7 Sentence Rank on Yahoo News Page
  • 4.6 Sentence Rankv
  • 4.5 Random pick sentence
  • 4.4 Word Rank
  • 4.3 Google search tips: meta keys and meta description
  • 4.2 Title
  • 4.1 The basics
  • 3.5 Deep Web Search Engine
 
 
Powered by:
BlogJava
Copyright © JosephQuinn