Dedian  
-- 关注搜索引擎的开发
日历
<2006年5月>
30123456
78910111213
14151617181920
21222324252627
28293031123
45678910
统计
  • 随笔 - 82
  • 文章 - 2
  • 评论 - 228
  • 引用 - 0

导航

常用链接

留言簿(8)

随笔分类(45)

随笔档案(82)

文章档案(2)

Java Spaces

搜索

  •  

积分与排名

  • 积分 - 64138
  • 排名 - 816

最新评论

阅读排行榜

评论排行榜

 

1. Develop a searching engine merely for Weblogs (Main jobs will be on WebCrawler, Indexer and Searcher part has been done for xml-based information retrieval)

Motivation:
    a. Weblog is more and more popular recently
    b. Though there has some weblog search engines such as Technorati and Blogdigger, but still seems lots of work need to do.
    c. the formats of weblog feed (RSS2.0 & Atom) are xml-based and more standard, which is very close to my current job on xml-based information retrieval
    d. easily extensible for crawling xml-based information websites besides weblogs
    
HOWTO:
         a. Utilize GData for feeding xml-based information
or      b. using some Open Source Crawlers + Lucene (similar idea in this article)
or      c. develop and merge my own simple Crawler package into my Shemy project which is clustering structure searching engine design based on Lucene

         likely: c > a > b (coz most open source crawlers are supposed to deal with much complex web pages/links, while since weblog feed is simpler, the crawler for it should be lighter)

Requirement/Functionality Analysis : (in progress)

Schedule: (in progress)

2. Exploration of performation tuning on searching issues to improve Shemy kernel
posted on 2006-05-17 06:36 Dedian 阅读(231) 评论(0)  编辑  收藏

只有注册用户登录后才能发表评论。


网站导航:
 
 
Copyright © Dedian Powered by: 博客园 模板提供:沪江博客