Avenue U

posts(42) comments(0) trackbacks(0)
  • BlogJava
  • 联系
  • RSS 2.0 Feed 聚合
  • 管理

常用链接

  • 我的随笔
  • 我的评论
  • 我的参与

留言簿

  • 给我留言
  • 查看公开留言
  • 查看私人留言

随笔分类

  • C++(1)
  • Core Java(2)
  • My Master-degree Project(33)
  • SSH(4)
  • struts2(1)

随笔档案

  • 2009年7月 (1)
  • 2009年6月 (41)

Core Java

最新随笔

  • 1. String Stream in C++
  • 2. Validators in Struts2
  • 3. An Interceptor Example in Strut2-Spring-Hibernate Application
  • 4. 3 Validators in Struts2-Spring-Hibernate
  • 5. Strut2-Spring-Hibernate under Lomboz Eclipse3.3
  • 6. Run Spring by Maven2 in Vista
  • 7. Appendix B
  • 8. 5 Conclusion
  • 9. 4.7 Sentence Rank on Yahoo News Page
  • 10. 4.6 Sentence Rankv

搜索

  •  

最新评论

阅读排行榜

评论排行榜

View Post

3.4 Result Page Comparison

If there is a URL match or content match, a success retrieval is established. If the URL does not match, due to URL’s changing all the time [2][3], comparison between original page and retrieved pages is indispensable and taken by 2 ways, manually and automatically. Manually checking all the content between original page and retrieved pages is time consuming but it can guarantee the precise results. In this project, we pick around 200 pages from the data source for manual checking. Rather than by brute force, automatic comparison between the result pages from search engine and each test page also needs HTML page preprocessing as in step 2.

         3-1

          3-2

In 3-2, TFw is the word’s term frequency in document 1 or document 2. In this project, some necessary removing are applied on pages, therefore, the comparison between 2 pages is only focusing on the main content which means all the advertisement, copyrights information, sponsor’s links and information are removed. It can be concluded as finding a similar topic within 2 different pages. Here are 3 pairs of example pages listed from Figure3.22 to Figure3.24. By using undirected weighted sentence rank algorithm, the highest ranking sentence can be picked up, input as a query into SE and then compared to the result page.

Figure3.22 (a) and (b) is an example of proving the validity of cosine comparison. The post time is shown in the red circle. In Figure3.22 (a), it doesn’t show the date but “34 mins ago”. In Figure3.22 (b), it shows “Mon Mar2, 11:57pm ET”. Actually, Figure3.22 (a) was downloaded in the morning on March 2, 2009 and Figure3.22 (b) was downloaded at noon on the same day. Apparently, Yahoo news editors keep updating and modifying the same news, so the later one gives some differences in the content but actually they are talking about the same issue.

Figure3.22 shows the downloaded HTML file images and (a)’s URL is

http://news.yahoo.com/s/ap/20090302/ap_on_re_us/winter_storm

The retrieval URL is http://news.yahoo.com/s/ap/20090303/ap_on_re_us/winter_storm_43

By comparing the different URL, it is obviously that even about the same content, yahoo news changes URL by adding “_43” in the end.

 

(a)                                                    (b)

Figure3.22

 

(a)                                                                      (b)

Figure3.23

Figure3.23 (a) and (b) is an example of finding a similar content web page, according to a downloaded local page Figure3.23 (a). Obviously, they are both talking about the missing NFL player in Florida’s Gulf which is one of the most popular news at the time of this experiment.

Figure3.23 (a)’s URL is

http://news.yahoo.com/s/ap/20090302/ap_on_re_us/missing_boaters_nfl

Figure3.23 (b)’s URL is

http://www.npr.org/templates/story/story.php?storyId=101375823&ft=1&f=1003

The documents’ similarity is 98.38% by 3-2

Figure3.24 (a) and (b) is another example of finding a similar content web page according to a downloaded local page. They are both talking the children’s blood lead level.

Figure3.24 (a)’s URL is:

http://news.yahoo.com/ /s/ap/20090302/ap_on_bi_go_ec_fi/economy

Figure3.24 (b)’s URL is

http://www.ajc.com/services/content/health/stories/2009/03/02/children_lead_level.html?cxtype=rss&cxsvc=7&cxcat=9

The documents similarity is 94% by 3-2.

 

(a)                                                                                        (b)

Figure3.24

posted on 2009-06-18 07:52 JosephQuinn 阅读(303) 评论(0)  编辑  收藏 所属分类: My Master-degree Project

新用户注册  刷新评论列表  

只有注册用户登录后才能发表评论。


网站导航:
博客园   IT新闻   Chat2DB   C++博客   博问   管理
相关文章:
  • Appendix B
  • 5 Conclusion
  • 4.7 Sentence Rank on Yahoo News Page
  • 4.6 Sentence Rankv
  • 4.5 Random pick sentence
  • 4.4 Word Rank
  • 4.3 Google search tips: meta keys and meta description
  • 4.2 Title
  • 4.1 The basics
  • 3.5 Deep Web Search Engine
 
 
Powered by:
BlogJava
Copyright © JosephQuinn