﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>语源科技BlogJava-小鱼的空气</title><link>http://www.blogjava.net/rainsf/</link><description>记录我所思</description><language>zh-cn</language><lastBuildDate>Thu, 07 May 2026 10:18:57 GMT</lastBuildDate><pubDate>Thu, 07 May 2026 10:18:57 GMT</pubDate><ttl>60</ttl><item><title>Nutch 0.9笔记</title><link>http://www.blogjava.net/rainsf/archive/2007/04/27/114022.html</link><dc:creator>小鱼</dc:creator><author>小鱼</author><pubDate>Fri, 27 Apr 2007 03:09:00 GMT</pubDate><guid>http://www.blogjava.net/rainsf/archive/2007/04/27/114022.html</guid><wfw:comment>http://www.blogjava.net/rainsf/comments/114022.html</wfw:comment><comments>http://www.blogjava.net/rainsf/archive/2007/04/27/114022.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/rainsf/comments/commentRss/114022.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/rainsf/services/trackbacks/114022.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 一直留意Lucene,Nutch的进展，最近这两个项目都发展得非常快，Lucne已发展到 2.1,Nutch已发展到 0.9，改进了很多，令人欣喜。<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;今天小试了一下Nutch-0.9,笔记如下：<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>1、解压Nutch包，在Nutch根目录下建目录urls,里面建一些包含URL的文本如urlt.txt，一行一个URL,内容如：http://www.blogjava.net<br><font color=#000000>http://www.javaeye.com/</font><br><br><br>2、修改conf目录下的<span style="COLOR: #ff00ff">crawl-urlfilter.txt,</span>片断如下：<br># accept hosts in MY.DOMAIN.NAME<br># +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/<br>+^http://www.blogjava.net/<br>+^http://www.javaeye.com/<br>+^http://lucene.apache.org/<br><br>3、修改conf目录下的<span style="COLOR: #ff00ff">nutch-site.xml</span>，内容如下：<br>
<div style="BORDER-RIGHT: #cccccc 1px solid; PADDING-RIGHT: 5px; BORDER-TOP: #cccccc 1px solid; PADDING-LEFT: 4px; FONT-SIZE: 13px; PADDING-BOTTOM: 4px; BORDER-LEFT: #cccccc 1px solid; WIDTH: 98%; WORD-BREAK: break-all; PADDING-TOP: 4px; BORDER-BOTTOM: #cccccc 1px solid; BACKGROUND-COLOR: #eeeeee"><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><span style="COLOR: #0000ff">&lt;?</span><span style="COLOR: #ff00ff">xml&nbsp;version="1.0"</span><span style="COLOR: #0000ff">?&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top></span><span style="COLOR: #0000ff">&lt;?</span><span style="COLOR: #ff00ff">xml-stylesheet&nbsp;type="text/xsl"&nbsp;href="configuration.xsl"</span><span style="COLOR: #0000ff">?&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top></span><span style="COLOR: #008000">&lt;!--</span><span style="COLOR: #008000">&nbsp;Put&nbsp;site-specific&nbsp;property&nbsp;overrides&nbsp;in&nbsp;this&nbsp;file.&nbsp;</span><span style="COLOR: #008000">--&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top></span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">configuration</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">property</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">name</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">http.agent.name</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">name</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">value</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">Nutch</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">value</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">description</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">HTTP&nbsp;'User-Agent'&nbsp;request&nbsp;header.&nbsp;MUST&nbsp;NOT&nbsp;be&nbsp;empty&nbsp;-&nbsp;<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;please&nbsp;set&nbsp;this&nbsp;to&nbsp;a&nbsp;single&nbsp;word&nbsp;uniquely&nbsp;related&nbsp;to&nbsp;your&nbsp;organization.<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;NOTE:&nbsp;You&nbsp;should&nbsp;also&nbsp;check&nbsp;other&nbsp;related&nbsp;properties:<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;http.robots.agents<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;http.agent.description<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;http.agent.url<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;http.agent.email<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;http.agent.version<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and&nbsp;set&nbsp;their&nbsp;values&nbsp;appropriately.<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">description</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">property</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">property</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">name</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">http.robots.agents</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">name</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">value</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">Nutch,*</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">value</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">description</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">The&nbsp;agent&nbsp;strings&nbsp;we'll&nbsp;look&nbsp;for&nbsp;in&nbsp;robots.txt&nbsp;files,<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;comma-separated,&nbsp;in&nbsp;decreasing&nbsp;order&nbsp;of&nbsp;precedence.&nbsp;You&nbsp;should<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;put&nbsp;the&nbsp;value&nbsp;of&nbsp;http.agent.name&nbsp;as&nbsp;the&nbsp;first&nbsp;agent&nbsp;name,&nbsp;and&nbsp;keep&nbsp;the<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;default&nbsp;*&nbsp;at&nbsp;the&nbsp;end&nbsp;of&nbsp;the&nbsp;list.&nbsp;E.g.:&nbsp;BlurflDev,Blurfl,*<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">description</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">property</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">property</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">name</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">http.agent.description</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">name</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">value</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">Nutch&nbsp;Search&nbsp;Engineer</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">value</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">description</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">Further&nbsp;description&nbsp;of&nbsp;our&nbsp;bot-&nbsp;this&nbsp;text&nbsp;is&nbsp;used&nbsp;in<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the&nbsp;User-Agent&nbsp;header.&nbsp;&nbsp;It&nbsp;appears&nbsp;in&nbsp;parenthesis&nbsp;after&nbsp;the&nbsp;agent&nbsp;name.<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">description</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">property</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">property</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">name</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">http.agent.url</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">name</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">value</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">http://lucene.apache.org/nutch/bot.html</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">value</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">description</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">A&nbsp;URL&nbsp;to&nbsp;advertise&nbsp;in&nbsp;the&nbsp;User-Agent&nbsp;header.&nbsp;&nbsp;This&nbsp;will&nbsp;<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;appear&nbsp;in&nbsp;parenthesis&nbsp;after&nbsp;the&nbsp;agent&nbsp;name.&nbsp;Custom&nbsp;dictates&nbsp;that&nbsp;this<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;should&nbsp;be&nbsp;a&nbsp;URL&nbsp;of&nbsp;a&nbsp;page&nbsp;explaining&nbsp;the&nbsp;purpose&nbsp;and&nbsp;behavior&nbsp;of&nbsp;this<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;crawler.<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">description</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">property</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">property</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">name</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">http.agent.email</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">name</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">value</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">nutch-agent@lucene.apache.org</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">value</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">description</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">An&nbsp;email&nbsp;address&nbsp;to&nbsp;advertise&nbsp;in&nbsp;the&nbsp;HTTP&nbsp;'From'&nbsp;request<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;header&nbsp;and&nbsp;User-Agent&nbsp;header.&nbsp;A&nbsp;good&nbsp;practice&nbsp;is&nbsp;to&nbsp;mangle&nbsp;this<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;address&nbsp;(e.g.&nbsp;'info&nbsp;at&nbsp;example&nbsp;dot&nbsp;com')&nbsp;to&nbsp;avoid&nbsp;spamming.<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">description</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">property</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top></span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">configuration</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top></span></div>
<br><span style="COLOR: red">注意</span>：在nutch-0.9.jar里面已包含nutch-site.xml，&nbsp; conf目录下的文件都复制过到classpath根下，如果是在WEB环境下运行classpath下的nutch-site.xml会优先加载，如果在在Application环境运行，应把如上nutch-site.xml打入到nutch-0.9.jar包里，否则，上面的一些属性为空不能运行。<br><br><br>4、在Windows下运行Nutch，很简单，只要你能执行Crawl这个类就行，写一个Ant脚本放在Nuthc的根目录下执行它就OK，内容如下：<br>
<div style="BORDER-RIGHT: #cccccc 1px solid; PADDING-RIGHT: 5px; BORDER-TOP: #cccccc 1px solid; PADDING-LEFT: 4px; FONT-SIZE: 13px; PADDING-BOTTOM: 4px; BORDER-LEFT: #cccccc 1px solid; WIDTH: 98%; WORD-BREAK: break-all; PADDING-TOP: 4px; BORDER-BOTTOM: #cccccc 1px solid; BACKGROUND-COLOR: #eeeeee"><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">project&nbsp;</span><span style="COLOR: #ff0000">name</span><span style="COLOR: #0000ff">="nutch-crawl"</span><span style="COLOR: #ff0000">&nbsp;default</span><span style="COLOR: #0000ff">="crawl"</span><span style="COLOR: #ff0000">&nbsp;basedir</span><span style="COLOR: #0000ff">="."</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">property&nbsp;</span><span style="COLOR: #ff0000">name</span><span style="COLOR: #0000ff">="lib.dir"</span><span style="COLOR: #ff0000">&nbsp;&nbsp;location</span><span style="COLOR: #0000ff">="lib"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">property&nbsp;</span><span style="COLOR: #ff0000">name</span><span style="COLOR: #0000ff">="conf.dir"</span><span style="COLOR: #ff0000">&nbsp;&nbsp;location</span><span style="COLOR: #0000ff">="conf"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">path&nbsp;</span><span style="COLOR: #ff0000">id</span><span style="COLOR: #0000ff">="project.classpath"</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">fileset&nbsp;</span><span style="COLOR: #ff0000">dir</span><span style="COLOR: #0000ff">="."</span><span style="COLOR: #ff0000">&nbsp;includes</span><span style="COLOR: #0000ff">="nutch-*.jar"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">fileset&nbsp;</span><span style="COLOR: #ff0000">dir</span><span style="COLOR: #0000ff">="lib"</span><span style="COLOR: #ff0000">&nbsp;</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">pathelement&nbsp;</span><span style="COLOR: #ff0000">path</span><span style="COLOR: #0000ff">="."</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">pathelement&nbsp;</span><span style="COLOR: #ff0000">path</span><span style="COLOR: #0000ff">="${conf.dir}"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">path</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">target&nbsp;</span><span style="COLOR: #ff0000">name</span><span style="COLOR: #0000ff">="crawl"</span><span style="COLOR: #ff0000">&nbsp;</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">echo</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">crwaling&nbsp;starting<img src="http://www.blogjava.net/Images/dot.gif"></span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">echo</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">property&nbsp;</span><span style="COLOR: #ff0000">name</span><span style="COLOR: #0000ff">="JVM.extra.args"</span><span style="COLOR: #ff0000">&nbsp;value</span><span style="COLOR: #0000ff">="-Xmx512m"</span><span style="COLOR: #ff0000">&nbsp;</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">java&nbsp;</span><span style="COLOR: #ff0000">classname</span><span style="COLOR: #0000ff">="org.apache.nutch.crawl.Crawl"</span><span style="COLOR: #ff0000">&nbsp;classpathref</span><span style="COLOR: #0000ff">="project.classpath"</span><span style="COLOR: #ff0000">&nbsp;fork</span><span style="COLOR: #0000ff">="true"</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">jvmarg&nbsp;</span><span style="COLOR: #ff0000">line</span><span style="COLOR: #0000ff">="${JVM.extra.args}"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">arg&nbsp;</span><span style="COLOR: #ff0000">value</span><span style="COLOR: #0000ff">="C:/dev-tools/nutch-0.9/urls"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">arg&nbsp;</span><span style="COLOR: #ff0000">value</span><span style="COLOR: #0000ff">="-dir"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">arg&nbsp;</span><span style="COLOR: #ff0000">value</span><span style="COLOR: #0000ff">="C:/dev-tools/nutch-0.9/crawl"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">arg&nbsp;</span><span style="COLOR: #ff0000">value</span><span style="COLOR: #0000ff">="-depth"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">arg&nbsp;</span><span style="COLOR: #ff0000">value</span><span style="COLOR: #0000ff">="3"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">arg&nbsp;</span><span style="COLOR: #ff0000">value</span><span style="COLOR: #0000ff">="-threads"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">arg&nbsp;</span><span style="COLOR: #ff0000">value</span><span style="COLOR: #0000ff">="15"</span><span style="COLOR: #0000ff">/&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">java</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;</span><span style="COLOR: #800000">echo</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000">crwaling&nbsp;finished<img src="http://www.blogjava.net/Images/dot.gif"></span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">echo</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">target</span><span style="COLOR: #0000ff">&gt;</span><span style="COLOR: #000000"><br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top>&nbsp;&nbsp;&nbsp;&nbsp;<br><img src="http://www.blogjava.net/Images/OutliningIndicators/None.gif" align=top></span><span style="COLOR: #0000ff">&lt;/</span><span style="COLOR: #800000">project</span><span style="COLOR: #0000ff">&gt;</span></div>
<br>至此，如无意外，Nutch已经欢快地运行起来，最后在crawl目录下你会发现你想要的东西，Enjoy it! 
<img src ="http://www.blogjava.net/rainsf/aggbug/114022.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/rainsf/" target="_blank">小鱼</a> 2007-04-27 11:09 <a href="http://www.blogjava.net/rainsf/archive/2007/04/27/114022.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>