Avenue U

4.5 Random pick sentence

As stated in chapter 3, the sentence rank can significantly improve the linguistic summarization other than the traditional TF or DF methods. Considering the complexity in sentence rank, randomly pick a sentence and take the first 3 to 15 words from the sentence within its original order as search query can avoid the iterations in graph-based ranking algorithm, and the results below show that even the sentences are randomly picked, when the number of terms up to 10, the performance increases enormously, some of them are higher than 75%, which cannot be accomplished easily by the previous carefully designed retrieval algorithms.

Random Sentence	Google		Yahoo
3	88.00	39.11%	90.00	40.00%
4	114.00	50.67%	102.00	45.33%
5	134.00	59.56%	124.00	55.11%
6	150.00	66.67%	144.00	64.00%
7	155.00	68.89%	137.00	60.89%
8	162.00	72.00%	154.00	68.44%
9	161.00	71.56%	144.00	64.00%
10	168.00	74.67%	151.00	67.11%
11	168.00	74.67%	151.00	67.11%
12	170.00	75.56%	168.00	74.67%
13	172.00	76.44%	168.00	74.67%
14	171.00	76.00%	169.00	75.11%
15	175.00	77.78%	174.00	77.33%
Average	152.92	67.97%	144.31	64.14%

Table4.24

(a) (b)

Figure4.26 Random Sentence Pick from Google and Yahoo results

posted @ 2009-06-18 12:00 JosephQuinn 阅读(296) | 评论 (0) | 编辑收藏

4.4 Word Rank

摘要: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 7.8 pt 0 2 false fals... 阅读全文

posted @ 2009-06-18 11:44 JosephQuinn 阅读(397) | 评论 (0) | 编辑收藏

4.3 Google search tips: meta keys and meta description

摘要: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 7.8 pt 0 2 false fals... 阅读全文

posted @ 2009-06-18 10:57 JosephQuinn 阅读(333) | 评论 (0) | 编辑收藏

4.2 Title

The text in HTML’s title tag is always playing a vital role in web page retrieval. During the beginning of this project, an extensive amount of experiments were conducted by using the title method. It was believed that the success rate would reach 90% from using title text as a query if the query could be composed carefully and properly. Figure4.12 shows that the title method also has a good stability along with the words number in a query. It is important to mention that, from Figure4.1 to Figure4.10, although the classic methods have better results, it only means the HTML extractions have good performance, which filter the structural HTML tags and functional scripts which could be big distractions in the following application on the target page, because all the basic retrieval process is only designed for pure text without structural tags. For example, HTML tags like ‘td’ and ‘tr’ will have a big term frequencies and the function or variable names in Javascript will cause a very low document frequencies, if they are not filtered or removed in the pre-processing step. However, by using title method, it is much easier to extract the text information only between <title> and </title>.

Title tag	Google		Yahoo
3	82.00	36.44%	72	32.00%
4	91.00	40.44%	86.00	38.22%
5	111.00	49.33%	94.00	41.78%
6	116.00	51.56%	99.00	44.00%
7	116.00	51.56%	102.00	45.33%
8	115.00	51.11%	102.00	45.33%
9	115.00	51.11%	101.00	44.89%
10	115.00	51.11%	101.00	44.89%
11	115.00	51.11%	102.00	45.33%
12	117.00	52.00%	102.00	45.33%
13	118.00	52.44%	103.00	45.78%
14	126.00	56.00%	111.00	49.33%
15	127.00	56.44%	112.00	49.78%
Average	112.62	50.05%	99.00	44.00%

Table4.11

(a) (b)

Figure4.12 Use title terms as search query

posted @ 2009-06-18 08:25 JosephQuinn 阅读(260) | 评论 (0) | 编辑收藏

4.1 The basics

摘要: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 7.8 pt 0 2 false fals... 阅读全文

posted @ 2009-06-18 08:15 JosephQuinn 阅读(301) | 评论 (0) | 编辑收藏

3.5 Deep Web Search Engine

A real implementation from this project is whether the ability of testing on general search engine can be applied on testing deep web search engine. The general search engines such as Google and Yahoo have been widely approved in their proper results and links. However, many sites may not allow their documents to be indexed but instead may allow the documents to be accessed through their search engines only, these sites are part of the so-called Deep Web ^[1][17]. The deep web search engines which only focus on their own data base and pages, data or documents which are kept privately and cannot be searched by general search engines. Take www.taobao.com as an example, it is a online commercial trading site like www.ebay.com, Taobao apparently abandons general search engines such as www.baidu.com and www.google.com to access its commodities results after the negotiations broken with the big search engine companies. This leads people who want commodity and price information have to go directly to Taobao’s own search engine interface and browse result items in Taobao’s website. Obviously, search engines in Taobao are probably developed by their own or contract consultant software teams, the performance then will be an interesting topic rather than the ones generally accepted by the public such as Google and Yahoo. The specific introduction for deep web and implementation of deep web search engines are not part of this project, but the practical value from this project can offer a feasible way in testing local and small search engines embedded in their own web sites.

posted @ 2009-06-18 07:53 JosephQuinn 阅读(346) | 评论 (0) | 编辑收藏

3.4 Result Page Comparison

If there is a URL match or content match, a success retrieval is established. If the URL does not match, due to URL’s changing all the time ^[2][3], comparison between original page and retrieved pages is indispensable and taken by 2 ways, manually and automatically. Manually checking all the content between original page and retrieved pages is time consuming but it can guarantee the precise results. In this project, we pick around 200 pages from the data source for manual checking. Rather than by brute force, automatic comparison between the result pages from search engine and each test page also needs HTML page preprocessing as in step 2.

3-1

3-2

In 3-2, TF_w is the word’s term frequency in document 1 or document 2. In this project, some necessary removing are applied on pages, therefore, the comparison between 2 pages is only focusing on the main content which means all the advertisement, copyrights information, sponsor’s links and information are removed. It can be concluded as finding a similar topic within 2 different pages. Here are 3 pairs of example pages listed from Figure3.22 to Figure3.24. By using undirected weighted sentence rank algorithm, the highest ranking sentence can be picked up, input as a query into SE and then compared to the result page.

Figure3.22 (a) and (b) is an example of proving the validity of cosine comparison. The post time is shown in the red circle. In Figure3.22 (a), it doesn’t show the date but “34 mins ago”. In Figure3.22 (b), it shows “Mon Mar2, 11:57pm ET”. Actually, Figure3.22 (a) was downloaded in the morning on March 2, 2009 and Figure3.22 (b) was downloaded at noon on the same day. Apparently, Yahoo news editors keep updating and modifying the same news, so the later one gives some differences in the content but actually they are talking about the same issue.

Figure3.22 shows the downloaded HTML file images and (a)’s URL is

http://news.yahoo.com/s/ap/20090302/ap_on_re_us/winter_storm

The retrieval URL is http://news.yahoo.com/s/ap/20090303/ap_on_re_us/winter_storm_43

By comparing the different URL, it is obviously that even about the same content, yahoo news changes URL by adding “_43” in the end.

(a) (b)

Figure3.22

(a) (b)

Figure3.23

Figure3.23 (a) and (b) is an example of finding a similar content web page, according to a downloaded local page Figure3.23 (a). Obviously, they are both talking about the missing NFL player in Florida’s Gulf which is one of the most popular news at the time of this experiment.

Figure3.23 (a)’s URL is

http://news.yahoo.com/s/ap/20090302/ap_on_re_us/missing_boaters_nfl

Figure3.23 (b)’s URL is

http://www.npr.org/templates/story/story.php?storyId=101375823&ft=1&f=1003

The documents’ similarity is 98.38% by 3-2

Figure3.24 (a) and (b) is another example of finding a similar content web page according to a downloaded local page. They are both talking the children’s blood lead level.

Figure3.24 (a)’s URL is:

http://news.yahoo.com/ /s/ap/20090302/ap_on_bi_go_ec_fi/economy

Figure3.24 (b)’s URL is

http://www.ajc.com/services/content/health/stories/2009/03/02/children_lead_level.html?cxtype=rss&cxsvc=7&cxcat=9

The documents similarity is 94% by 3-2.

(a) (b)

Figure3.24

posted @ 2009-06-18 07:52 JosephQuinn 阅读(306) | 评论 (0) | 编辑收藏

3.3.3 Query Length

In chapter2, section2.1, S. T. Park adopted 5 terms a query. However, a wider range of term numbers in a query is adopted in this project: the length of LS from 3 to 15 versus the success rate is compared together while sentence rank, take first N words in the selected sentence, from 3 to 15, even including stop words from the top ranked sentences as a search query and remove the rest of them left in the sentences. This procedure does not follow the traditional ways in text retrieval, however, in chapter 5, the experiments show even better results when the terms number are more than 10, compared to the same terms number in traditional ways.

posted @ 2009-06-18 07:43 JosephQuinn 阅读(166) | 评论 (0) | 编辑收藏

3.3.2 HTML Parsing and Text Extraction

It is necessary at this point to clarify why LS extraction cannot be applied directly on the raw web page which are downloaded in full size without any parser. The first is, other than pure text information retrieval, the web pages have their unique feature, HTML tags, which help to construct page template, font format, font size, images insertion and other components for a fancy appearance. However, these good looking gadgets in the web pages actually are the sources of distractions and interferences when the applications are trying to analyze them. Because only the showing text part in a web page is useful in common sense. How to transfer the HTML page to pure text by removing all kinds of hidden tags is a key issue to the following steps and decide the final results. The text in the page must be all extracted at first, meanwhile, the tags information behind the text can not be simply discarded, for example, in Michal’s research, she classified and saved the text into 6 different categories, each category takes a unique weight. The second is, the link information is also a powerful hint in deciding the unique feature of a particular web page. For example, the commercial search engines largely depend on the algorithms like page-rank and authority and hubs. Even for searching and retrieval studies in academic papers, the citation rank algorithm is also widely accepted. However, not same like academic papers, which contain the citations in the end of each paper as a references chapter, web pages’ link information hides in the anchor tags, which leads to more complicated data-source preprocessing before LS extraction. Construct a query with extracting the link information, such as the domain that the page belongs to, combined with LS could be another study but not included in this report.

posted @ 2009-06-18 07:42 JosephQuinn 阅读(230) | 评论 (0) | 编辑收藏

3.3.1 Page Quality

摘要: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 7.8 pt 0 2 false fals... 阅读全文

posted @ 2009-06-18 07:41 JosephQuinn 阅读(470) | 评论 (0) | 编辑收藏

常用链接

留言簿

随笔分类

随笔档案

Core Java

最新随笔

搜索

最新评论

阅读排行榜

评论排行榜

4.5 Random pick sentence

4.4 Word Rank

4.3 Google search tips: meta keys and meta description

4.2 Title

4.1 The basics

3.5 Deep Web Search Engine

3.4 Result Page Comparison

3.3.3 Query Length

3.3.2 HTML Parsing and Text Extraction

3.3.1 Page Quality