Dedian  
-- 关注搜索引擎的开发
日历
<2006年5月>
30123456
78910111213
14151617181920
21222324252627
28293031123
45678910
统计
  • 随笔 - 82
  • 文章 - 2
  • 评论 - 228
  • 引用 - 0

导航

常用链接

留言簿(8)

随笔分类(45)

随笔档案(82)

文章档案(2)

Java Spaces

搜索

  •  

积分与排名

  • 积分 - 64129
  • 排名 - 816

最新评论

阅读排行榜

评论排行榜

 
+ Webcrawler
   
    -- study open source code
          purpose: analyze code structure and basic componences
          focus on: Nutch (http://lucene.apache.org/nutch/)
                    & HTMLParser (http://htmlparser.sourceforge.net/)
                     & GData(http://code.google.com/apis/gdata/overview.html)

    -- understand PageRank idea
       relative articles:
       http://en.wikipedia.org/wiki/PageRank
       http://www.thesitewizard.com/archive/google.shtml
       paper : "PageRank Uncoverd" by Chris Ridings and Mike Shishigin
       http://www.rankforsales.com/n-aa/095-seo-may-31-03.html (about Chris Ridings & SEO)
       http://en.wikipedia.org/wiki/Web_crawler (basic idea about crawler)
      
    -- familar with RSS & Atom protocol

    -- sample coding:
       Interface: Scheduler for fetching web links
       Interface: Web page paser/Analyzer --> to deal with XML-based websites(Weblogs or news sites, RSS & Atom) --> Paser classes based on SAX parser
       Interface: Retractor/Fetcher --> to get links from page
       Interface: Collector --> check URL whether duplicated and save in URL database with certian data structure
       Interface: InformationProcesser --> PageRank should be one important factor --> (under thinking)
       Interface: Policies(Filter) --> will be served for Collector and InformationProcessor --> (under thinking)

+ Indexer/Searcher (almost done base on Lucene)
posted on 2006-05-19 09:40 Dedian 阅读(290) 评论(1)  编辑  收藏
评论:

只有注册用户登录后才能发表评论。


网站导航:
 
 
Copyright © Dedian Powered by: 博客园 模板提供:沪江博客