Avenue U

posts(42) comments(0) trackbacks(0)
  • BlogJava
  • 联系
  • RSS 2.0 Feed 聚合
  • 管理

常用链接

  • 我的随笔
  • 我的评论
  • 我的参与

留言簿

  • 给我留言
  • 查看公开留言
  • 查看私人留言

随笔分类

  • C++(1)
  • Core Java(2)
  • My Master-degree Project(33)
  • SSH(4)
  • struts2(1)

随笔档案

  • 2009年7月 (1)
  • 2009年6月 (41)

Core Java

最新随笔

  • 1. String Stream in C++
  • 2. Validators in Struts2
  • 3. An Interceptor Example in Strut2-Spring-Hibernate Application
  • 4. 3 Validators in Struts2-Spring-Hibernate
  • 5. Strut2-Spring-Hibernate under Lomboz Eclipse3.3
  • 6. Run Spring by Maven2 in Vista
  • 7. Appendix B
  • 8. 5 Conclusion
  • 9. 4.7 Sentence Rank on Yahoo News Page
  • 10. 4.6 Sentence Rankv

搜索

  •  

最新评论

阅读排行榜

评论排行榜

View Post

2.5.4 Sentence-Rank on Web Pages

Similar to Xiaojun Wang’s applying word rank on web pages, this project applies sentence rank on web page. The passages in the web pages can be extracted as shown in Figure2.19’s red square.

Figure2.19

However, there are conditions sentence rank can not work. Some web pages may not have sentences or passages, which make sentence rank on those pages not effective when there are only titles or phrases. Take Figure2.20 as an example, there is not any passage in Yahoo’s home page, and although there are several sentences in the center 3 red squares which are shown in anchor tags separately, it brings difficulties to construct connections among those independent sentences, because they actually come from completely different topics. Meanwhile, there are a bunch of simple words and phrases in the left blue squares, such as “answer”, “auto” and “finance”. It brings challenges in combining the terms and sentences as well as applying sentence rank. Therefore, the page like Figure2.20 is not a typical type can be applied by sentence rank.

Figure2.20 A typical example of link-based page

There is a simple way to exclude the pages which are not suitable for sentence rank. A threshold p is defined to separate the pages into 2 categories linked-based page and content-based page after using formula 2-11.

  2-11

The pages like Figure2.20 can be concluded as a linked-based page which has a high portion with text in link. The linked-based pages are easily found from the websites’ home page and index page. Compared to Figure2.19, a content-based page has high portion in plain text without link such as Figure2.20.

posted on 2009-06-18 03:20 JosephQuinn 阅读(267) 评论(0)  编辑  收藏 所属分类: My Master-degree Project

新用户注册  刷新评论列表  

只有注册用户登录后才能发表评论。


网站导航:
博客园   IT新闻   Chat2DB   C++博客   博问   管理
相关文章:
  • Appendix B
  • 5 Conclusion
  • 4.7 Sentence Rank on Yahoo News Page
  • 4.6 Sentence Rankv
  • 4.5 Random pick sentence
  • 4.4 Word Rank
  • 4.3 Google search tips: meta keys and meta description
  • 4.2 Title
  • 4.1 The basics
  • 3.5 Deep Web Search Engine
 
 
Powered by:
BlogJava
Copyright © JosephQuinn