Dedian  
-- 关注搜索引擎的开发
日历
<2006年6月>
28293031123
45678910
11121314151617
18192021222324
2526272829301
2345678
统计
  • 随笔 - 82
  • 文章 - 2
  • 评论 - 228
  • 引用 - 0

导航

常用链接

留言簿(8)

随笔分类(45)

随笔档案(82)

文章档案(2)

Java Spaces

搜索

  •  

积分与排名

  • 积分 - 64129
  • 排名 - 816

最新评论

阅读排行榜

评论排行榜

 
Still working on Webcrawler part, the URL collection strategies are under thinking. A URL frontier which stores the list of  activate URLs to be parsed or downloaded will be applied to handle for synchonized I/O operations with URL collection/Inventory, stuck by some issues:

1. Duplicate URL Elimination:
    a. Host name aliases --> DNS Resolver
    b. Omitted port numbers
    c. Alternative paths on the same host
    d. replication across difference host
    e. non-sense links or session IDs embedded in URLs ?
2. Reachable of URL
3. Distributed Storage of URL Inventory and relative synchronization problem
4. Fetch strategies for URL Frontier or Fetchor to get activate links for parsing
5. Scheduler for fetching and updating URL collection: multi-thread or single thread on each pc, when to decide re-parsing a page
7. URL-Seen test: if that page has been parsed and should it re-parse? which should be done before entering URL frontier...
8. Extensibility issues for those modules: Fetcher, Extractor/Filters, Collector...
9. Checkpointing for crawlering interupted: how to resume the crawler job, how to split crawler jobs and distribute to different machines

seems that I need couple days to refine my systen architecture design...
posted on 2006-06-09 08:57 Dedian 阅读(841) 评论(0)  编辑  收藏

只有注册用户登录后才能发表评论。


网站导航:
 
 
Copyright © Dedian Powered by: 博客园 模板提供:沪江博客