Avenue U

posts(42) comments(0) trackbacks(0)
  • BlogJava
  • 联系
  • RSS 2.0 Feed 聚合
  • 管理

常用链接

  • 我的随笔
  • 我的评论
  • 我的参与

留言簿

  • 给我留言
  • 查看公开留言
  • 查看私人留言

随笔分类

  • C++(1)
  • Core Java(2)
  • My Master-degree Project(33)
  • SSH(4)
  • struts2(1)

随笔档案

  • 2009年7月 (1)
  • 2009年6月 (41)

Core Java

最新随笔

  • 1. String Stream in C++
  • 2. Validators in Struts2
  • 3. An Interceptor Example in Strut2-Spring-Hibernate Application
  • 4. 3 Validators in Struts2-Spring-Hibernate
  • 5. Strut2-Spring-Hibernate under Lomboz Eclipse3.3
  • 6. Run Spring by Maven2 in Vista
  • 7. Appendix B
  • 8. 5 Conclusion
  • 9. 4.7 Sentence Rank on Yahoo News Page
  • 10. 4.6 Sentence Rankv

搜索

  •  

最新评论

阅读排行榜

评论排行榜

View Post

4.1 The basics

4     Experimental Result and Analysis

All the following experiments were conducted from April 24, 2009 to April 30, 2009. Each term’s document frequency is from Google web search interface and counted on April 24, 2009.

4.1   The basics

The basics involve all extraction LS method from Seung Park’s paper including TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2 and TFIDF4DF1. According to the length of the query, in this project, they are ranged from 3 to 15 terms in each query. The ‘3’ and ‘2’ which means 60% percent of TF terms and 40% percent of DF terms among all terms in each query, ‘4’ and ‘1’, which means 80% percent of TF terms and 20% percent of DF terms among all terms in each query, represent the ratio between TF and DF or TFIDF and DF. The details are all in section3.1, chapter3. Meanwhile, due to longer length in each query, another 2 groups can also be added which are not in Seung Park’s paper, TF5DF5 and TFID5DF5, which means there are 50% TF and 50% DF terms, 50%TFIDF and 50%DF compose the query. The detailed TF, DF, IDF selections are strictly followed the idea from Seung Park’s paper.

The following charts show the success retrieved number per 225 URLs and its percentage value. The 225 URL are all listed in Appendix B. It is necessary to mention ‘success counts’ in Y axis: by sending a query to a search engine, if the first 10 result URLs have at least one match with the original URL, increase 1 to success counts.

In all the following sub-sections, the blue lines represent the results from Google web search while the red lines represent the results from Yahoo web search. The left charts show the exact number of success retrieved pages among all 225 pages. The right charts show the success percentage rate which is the value that success retrieved pages number divided by 225.

TF(Words Number)

Google

Yahoo

3

61.00

27.11%

58.00

25.78%

4

91.00

40.44%

86.00

38.22%

5

113.00

50.22%

98.00

43.56%

6

122.00

54.22%

110.00

48.89%

7

137.00

60.89%

128.00

56.89%

8

144.00

64.00%

127.00

56.44%

9

153.00

68.00%

126.00

56.00%

10

153.00

68.00%

128.00

56.89%

11

154.00

68.44%

127.00

56.44%

12

157.00

69.78%

132.00

58.67%

13

160.00

71.11%

131.00

58.22%

14

159.00

70.67%

126.00

56.00%

15

158.00

70.22%

130.00

57.78%

Average

135.54

60.24%

115.92

51.52%

Table4.1 TF

(a)                                                                                        (b)

Figure4.1 success retrieved pages’ counts per 225 pages and corresponding percentage value by TF

DF

Google

Yahoo

3

155.00

68.89%

136.00

60.44%

4

156.00

69.33%

140.00

62.22%

5

161.00

71.56%

140.00

62.22%

6

162.00

72.00%

134.00

59.56%

7

164.00

72.89%

129.00

57.33%

8

163.00

72.44%

129.00

57.33%

9

163.00

72.44%

134.00

59.56%

10

159.00

70.67%

130.00

57.78%

11

162.00

72.00%

131.00

58.22%

12

160.00

71.11%

126.00

56.00%

13

159.00

70.67%

130.00

57.78%

14

162.00

72.00%

123.00

54.67%

15

161.00

71.56%

125.00

55.56%

Average

160.54

71.35%

131.31

58.36%

Table4.2 DF

 

(a)                                                                                                  (b)

Figure4.2 Success retrieved pages’ counts per 225 pages and corresponding percentage value by DF

Figure4.1 shows that the success retrieved rate is growing along with the number of terms in a query, and then becoming flat and stable after 10 terms a query. In this comparison, DF(document frequency) does not significantly change the success retrieved rate according to Figure4.2, the stability, which is around 70% success retrieve rate in Google and 60% in Yahoo, ranging all the terms number from 3 to 15, this suggests that DF has a good performance in identifying the page itself within the returning results from both Google and Yahoo even the terms number is low like 3, 4, or 5, which show very similar success rate as terms number grows higher than 10. As the following experiments show, when the DF ratio increases in the query such as TF5DF5 which DF terms’ ratio is 50%, the success rate also increases when the query length is smaller, like 3, 4, or 5, compared to TF4DF1 which DF terms’ ratio is 20%.

TFIDF

Google

Yahoo

3

80.00

35.56%

81.00

36.00%

4

105.00

46.67%

93.00

41.33%

5

134.00

59.56%

114.00

50.67%

6

144.00

64.00%

125.00

55.56%

7

151.00

67.11%

141.00

62.67%

8

160.00

71.11%

131.00

58.22%

9

162.00

72.00%

135.00

60.00%

10

165.00

73.33%

137.00

60.89%

11

168.00

74.67%

133.00

59.11%

12

167.00

74.22%

142.00

63.11%

13

169.00

75.11%

131.00

58.22%

14

168.00

74.67%

133.00

59.11%

15

168.00

74.67%

131.00

58.22%

Average

149.31

66.36%

125.15

55.62%

Table4.3 TFIDF

 

(a)                                                                                        (b)

Figure4.3 success retrieved pages’ counts per 225 pages and corresponding percentage value by TFIDF.

PW

Google

Yahoo

3

60.00

26.67%

61.00

27.11%

4

92.00

40.89%

81.00

36.00%

5

115.00

51.11%

94.00

41.78%

6

121.00

53.78%

108.00

48.00%

7

137.00

60.89%

123.00

54.67%

8

144.00

64.00%

124.00

55.11%

9

153.00

68.00%

126.00

56.00%

10

154.00

68.44%

126.00

56.00%

11

154.00

68.44%

121.00

53.78%

12

158.00

70.22%

121.00

53.78%

13

162.00

72.00%

130.00

57.78%

14

163.00

72.44%

122.00

54.22%

15

158.00

70.22%

124.00

55.11%

Average

136.23

60.55%

112.38

49.95%

Table4.4 PW

 

(a)                                                                                                  (b)

Figure4.4 success retrieved pages’ counts per 225 pages and corresponding percentage value by PW

TF3DF2

Google

Yahoo

3

128.00

56.89%

123.00

54.67%

4

137.00

60.89%

130.00

57.78%

5

154.00

68.44%

137.00

60.89%

6

162.00

72.00%

141.00

62.67%

7

165.00

73.33%

133.00

59.11%

8

170.00

75.56%

140.00

62.22%

9

172.00

76.44%

140.00

62.22%

10

168.00

74.67%

149.00

66.22%

11

170.00

75.56%

144.00

64.00%

12

168.00

74.67%

145.00

64.44%

13

165.00

73.33%

142.00

63.11%

14

166.00

73.78%

135.00

60.00%

15

163.00

72.44%

128.00

56.89%

Average

160.62

71.38%

137.46

61.09%

Table4.5 TF3DF2

 

(a)                                                                                        (b)

Figure4.5 success retrieved pages’ counts per 225 pages and corresponding percentage value by TF3DF2

TF4DF1

Google

Yahoo

3

50.00

22.22%

51.00

22.67%

4

83.00

36.89%

78.00

34.67%

5

153.00

68.00%

140.00

62.22%

6

158.00

70.22%

126.00

56.00%

7

165.00

73.33%

135.00

60.00%

8

167.00

74.22%

131.00

58.22%

9

169.00

75.11%

131.00

58.22%

10

169.00

75.11%

134.00

59.56%

11

168.00

74.67%

140.00

62.22%

12

169.00

75.11%

138.00

61.33%

13

169.00

75.11%

146.00

64.89%

14

167.00

74.22%

143.00

63.56%

15

168.00

74.67%

142.00

63.11%

Average

150.38

66.84%

125.77

55.90%

Table4.6 TF4DF1

 

(a)                                                                                        (b)

Figure4.6 success retrieved pages’ counts per 225 pages and corresponding percentage value by TF4DF1

TF5DF5

Google

Yahoo

3

128.00

56.89%

122.00

54.22%

4

149.00

66.22%

137.00

60.89%

5

154.00

68.44%

140.00

62.22%

6

165.00

73.33%

147.00

65.33%

7

167.00

74.22%

147.00

65.33%

8

170.00

75.56%

143.00

63.56%

9

168.00

74.67%

144.00

64.00%

10

165.00

73.33%

139.00

61.78%

11

166.00

73.78%

138.00

61.33%

12

163.00

72.44%

133.00

59.11%

13

163.00

72.44%

138.00

61.33%

14

163.00

72.44%

131.00

58.22%

15

163.00

72.44%

132.00

58.67%

Average

160.31

71.25%

137.7692

61.23%

Table4.7 TF5DF5

 

(a)                                                                                        (b)

Figure4.7 success retrieved pages’ counts per 225 pages and corresponding percentage value by TF5DF5

TFIDF3DF2

Google

Yahoo

3

127.00

56.44%

130.00

57.78%

4

139.00

61.78%

134.00

59.56%

5

162.00

72.00%

138.00

61.33%

6

164.00

72.89%

146.00

64.89%

7

167.00

74.22%

141.00

62.67%

8

168.00

74.67%

144.00

64.00%

9

170.00

75.56%

147.00

65.33%

10

170.00

75.56%

145.00

64.44%

11

168.00

74.67%

146.00

64.89%

12

168.00

74.67%

148.00

65.78%

13

166.00

73.78%

144.00

64.00%

14

164.00

72.89%

140.00

62.22%

15

162.00

72.00%

142.00

63.11%

Average

161.15

71.62%

141.92

63.08%

Table4.8 TFIDF3DF2

 

(a)                                                                                        (b)

Figure4.8 success retrieved pages’ counts per 225 pages and corresponding percentage value by TFIDF3DF2

TFIDF4DF1

Google

Yahoo

3

80.00

35.56%

81.00

36.00%

4

105.00

46.67%

96.00

42.67%

5

151.00

67.11%

135.00

60.00%

6

161.00

71.56%

124.00

55.11%

7

168.00

74.67%

137.00

60.89%

8

168.00

74.67%

139.00

61.78%

9

172.00

76.44%

140.00

62.22%

10

172.00

76.44%

142.00

63.11%

11

170.00

75.56%

146.00

64.89%

12

170.00

75.56%

142.00

63.11%

13

170.00

75.56%

141.00

62.67%

14

169.00

75.11%

147.00

65.33%

15

167.00

74.22%

145.00

64.44%

Average

155.62

69.16%

131.92

58.63%

Table4.9 TFIDF4DF1

 

(a)                                                                                                  (b)

Figure4.9 success retrieved pages’ counts per 225 pages and corresponding percentage value by TFIDF4DF1

TFIDF5DF5

Google

Yahoo

3

127.00

56.44%

130.00

57.78%

4

154.00

68.44%

142.00

63.11%

5

162.00

72.00%

141.00

62.67%

6

164.00

72.89%

145.00

64.44%

7

167.00

74.22%

144.00

64.00%

8

169.00

75.11%

139.00

61.78%

9

169.00

75.11%

144.00

64.00%

10

166.00

73.78%

142.00

63.11%

11

167.00

74.22%

145.00

64.44%

12

164.00

72.89%

131.00

58.22%

13

163.00

72.44%

146.00

64.89%

14

162.00

72.00%

135.00

60.00%

15

163.00

72.44%

133.00

59.11%

Average

161.31

71.69%

139.77

62.12%

Table4.10 TFIDF5DF5

 

(a)                                                                                                  (b)

Figure4.10 success retrieved pages’ counts per 225 pages and corresponding percentage value by TFIDF5DF5

As shown in Figure4.1 to Figure4.10, the basic information retrieval methods are well applied on 225 online pages, after query terms number exceeds 10, the rate is around 70% from Google and above 60% from Yahoo, meanwhile, all of them become stable after terms number larger than 10. Then the average success rate is computed separately from terms number 3 to 15 and the results are shown in Figure4.11. For a better comparison, I take Title method in section4.2, chapter4 ahead. Apparently, DF, TF3DF2, TF5DF5, TFIDF3DF2 and TFIDF5DF5 have higher success rate than the others. They have more than 70% success rate from Google and 60% from Yahoo, except DF’s Yahoo result, but it is still higher than the others. Again, it shows the DF’s importance in retrieval.

Figure4.11 all basic TF, DF and IDF related methods comparison

posted on 2009-06-18 08:15 JosephQuinn 阅读(301) 评论(0)  编辑  收藏 所属分类: My Master-degree Project

新用户注册  刷新评论列表  

只有注册用户登录后才能发表评论。


网站导航:
博客园   IT新闻   Chat2DB   C++博客   博问   管理
相关文章:
  • Appendix B
  • 5 Conclusion
  • 4.7 Sentence Rank on Yahoo News Page
  • 4.6 Sentence Rankv
  • 4.5 Random pick sentence
  • 4.4 Word Rank
  • 4.3 Google search tips: meta keys and meta description
  • 4.2 Title
  • 4.1 The basics
  • 3.5 Deep Web Search Engine
 
 
Powered by:
BlogJava
Copyright © JosephQuinn