+ Webcrawler
    
    -- study open source code
            purpose: analyze code structure and basic componences
            focus on: Nutch (http://lucene.apache.org/nutch/) 
           
        & HTMLParser
(http://htmlparser.sourceforge.net/)
               
     &
GData(http://code.google.com/apis/gdata/overview.html)
    -- understand PageRank idea
        relative articles:
        http://en.wikipedia.org/wiki/PageRank
        http://www.thesitewizard.com/archive/google.shtml
       paper : "PageRank Uncoverd" by Chris Ridings and Mike Shishigin
       http://www.rankforsales.com/n-aa/095-seo-may-31-03.html (about Chris Ridings & SEO)
        http://en.wikipedia.org/wiki/Web_crawler (basic idea about crawler)
        
    -- familar with RSS & Atom protocol
    -- sample coding:
        Interface: Scheduler for fetching web links
       Interface: Web page paser/Analyzer
--> to deal with XML-based websites(Weblogs or news sites, RSS &
Atom) --> Paser classes based on SAX parser
        Interface: Retractor/Fetcher --> to get links from page
       Interface: Collector --> check URL
whether duplicated and save in URL database with certian data structure
       Interface: InformationProcesser -->
PageRank should be one important factor --> (under thinking)
       Interface: Policies(Filter) --> will
be served for Collector and InformationProcessor --> (under thinking)
+ Indexer/Searcher (almost done base on Lucene)