﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-海水正蓝-随笔分类-Heritrix</title><link>http://www.blogjava.net/xiaohuzi2008/category/53143.html</link><description>面朝大海，春暖花开</description><language>zh-cn</language><lastBuildDate>Thu, 17 Jan 2013 09:19:27 GMT</lastBuildDate><pubDate>Thu, 17 Jan 2013 09:19:27 GMT</pubDate><ttl>60</ttl><item><title>【转】java实现Google和Baidu的“您是不是要找”功能</title><link>http://www.blogjava.net/xiaohuzi2008/archive/2013/01/16/394314.html</link><dc:creator>小胡子</dc:creator><author>小胡子</author><pubDate>Wed, 16 Jan 2013 09:39:00 GMT</pubDate><guid>http://www.blogjava.net/xiaohuzi2008/archive/2013/01/16/394314.html</guid><wfw:comment>http://www.blogjava.net/xiaohuzi2008/comments/394314.html</wfw:comment><comments>http://www.blogjava.net/xiaohuzi2008/archive/2013/01/16/394314.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/xiaohuzi2008/comments/commentRss/394314.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/xiaohuzi2008/services/trackbacks/394314.html</trackback:ping><description><![CDATA[<div><p><strong>背景</strong> ：</p> <p>　　在使用搜索引擎和电商的搜索功能时，大家一定遇到过这样的情景：我想搜索电影&#8220;十二生肖&#8221;，可不小心输成&#8220;十二生效&#8221;了，不用担心搜不到你想要的结果，因为建立在大数据上的搜索引擎会帮你自动纠错，就这个例子Google和Baidu返回给我的分别是：</p> <p><strong><span style="color: #ff6600;">显示以下查询字词的结果：</span>&nbsp;<a href="https://www.google.com.hk/search?hl=zh-CN&amp;newwindow=1&amp;safe=strict&amp;tbo=d&amp;biw=1280&amp;bih=901&amp;spell=1&amp;q=%E5%8D%81%E4%BA%8C%E7%94%9F%E8%82%96&amp;sa=X&amp;ei=5FT2UP6pBcmfiAe1x4GYAQ&amp;ved=0CC4QvwUoAA">十二<em>生肖</em></a>&nbsp;和<span style="color: #ff6600;">&nbsp;您要找的是不是:</span>&nbsp;<a href="http://www.baidu.com/s?wd=%E5%8D%81%E4%BA%8C%E7%94%9F%E8%82%96&amp;f=12&amp;rsp=0&amp;oq=%E5%8D%81%E4%BA%8C%E7%94%9F%E6%95%88&amp;tn=baiduhome_pg&amp;ie=utf-8">十二生肖</a>&nbsp;</strong>，他们都做到了自动纠错，关于自动纠错我之前也写过<a href="http://www.cnblogs.com/wuren/archive/2012/12/21/2828649.html"><strong>一篇陋文</strong></a>，当时是自己实现的N-Gram模型,但是效果不是太好，主要是针对不同的语料库算法的精确度是不一样的，我想换个算法试试看，目前主流的计算串间的距离(相反的，你也可以理解为相似度)是<span>Levenshtein，当要实现时，发现lucene已经做了这个事，那咱就站在巨人的肩膀上成长吧。</span></p> <p><strong>引用包：</strong></p> <p><strong>　　</strong>lucene-core-3.1.0.jar +&nbsp;lucene-spellchecker-3.1.0.jar,你可以在<a href="http://files.cnblogs.com/wuren/lucene-core-spellchecker.rar"><strong>这里得到</strong></a></p> <p><strong>使用示例：</strong></p> <p><strong>　　在类SpellCorrector的main方法中加入以下代码</strong></p></div><br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #008080; ">&nbsp;1</span>&nbsp;<span style="color: #008000; ">//</span><span style="color: #008000; ">创建目录</span><span style="color: #008000; "><br /></span><span style="color: #008080; ">&nbsp;2</span>&nbsp;<span style="color: #008000; "></span><span style="color: #000000; ">&nbsp;File&nbsp;dict&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;File(</span><span style="color: #000000; ">""</span><span style="color: #000000; ">);<br /></span><span style="color: #008080; ">&nbsp;3</span>&nbsp;<span style="color: #000000; ">&nbsp;Directory&nbsp;directory&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;FSDirectory.open(dict);<br /></span><span style="color: #008080; ">&nbsp;4</span>&nbsp;<span style="color: #000000; ">&nbsp;<br /></span><span style="color: #008080; ">&nbsp;5</span>&nbsp;<span style="color: #000000; ">&nbsp;</span><span style="color: #008000; ">//</span><span style="color: #008000; ">实例化拼写检查器&nbsp;</span><span style="color: #008000; "><br /></span><span style="color: #008080; ">&nbsp;6</span>&nbsp;<span style="color: #008000; "></span><span style="color: #000000; ">&nbsp;SpellChecker&nbsp;sp&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;SpellChecker(directory);<br /></span><span style="color: #008080; ">&nbsp;7</span>&nbsp;<span style="color: #000000; ">&nbsp;&nbsp;<br /></span><span style="color: #008080; ">&nbsp;8</span>&nbsp;<span style="color: #000000; ">&nbsp;<br /></span><span style="color: #008080; ">&nbsp;9</span>&nbsp;<span style="color: #000000; ">&nbsp;</span><span style="color: #008000; ">//</span><span style="color: #008000; ">创建词典</span><span style="color: #008000; "><br /></span><span style="color: #008080; ">10</span>&nbsp;<span style="color: #008000; "></span><span style="color: #000000; ">&nbsp;File&nbsp;dictionary&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;File(SpellCorrecter.</span><span style="color: #0000FF; ">class</span><span style="color: #000000; ">.getResource(</span><span style="color: #000000; ">"</span><span style="color: #000000; ">dictionary.txt</span><span style="color: #000000; ">"</span><span style="color: #000000; ">).getFile());<br /></span><span style="color: #008080; ">11</span>&nbsp;<span style="color: #000000; ">&nbsp;<br /></span><span style="color: #008080; ">12</span>&nbsp;<span style="color: #000000; ">&nbsp;</span><span style="color: #008000; ">//</span><span style="color: #008000; ">对词典进行索引</span><span style="color: #008000; "><br /></span><span style="color: #008080; ">13</span>&nbsp;<span style="color: #008000; "></span><span style="color: #000000; ">&nbsp;sp.indexDictionary(</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;PlainTextDictionary(dictionary));<br /></span><span style="color: #008080; ">14</span>&nbsp;<span style="color: #000000; ">&nbsp;&nbsp;<br /></span><span style="color: #008080; ">15</span>&nbsp;<span style="color: #000000; ">&nbsp;<br /></span><span style="color: #008080; ">16</span>&nbsp;<span style="color: #000000; ">&nbsp;</span><span style="color: #008000; ">//</span><span style="color: #008000; ">有错别字的搜索</span><span style="color: #008000; "><br /></span><span style="color: #008080; ">17</span>&nbsp;<span style="color: #008000; "></span><span style="color: #000000; ">&nbsp;String&nbsp;search&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">"</span><span style="color: #000000; ">非常勿扰</span><span style="color: #000000; ">"</span><span style="color: #000000; ">;<br /></span><span style="color: #008080; ">18</span>&nbsp;<span style="color: #000000; ">&nbsp;&nbsp;<br /></span><span style="color: #008080; ">19</span>&nbsp;<span style="color: #000000; ">&nbsp;<br /></span><span style="color: #008080; ">20</span>&nbsp;<span style="color: #000000; ">&nbsp;</span><span style="color: #008000; ">//</span><span style="color: #008000; ">建议个数，这里我只想要最接近的那一个，你可以设置成别的数字，如3</span><span style="color: #008000; "><br /></span><span style="color: #008080; ">21</span>&nbsp;<span style="color: #008000; "></span><span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">int</span><span style="color: #000000; ">&nbsp;suggestionNumber&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">1</span><span style="color: #000000; ">;<br /></span><span style="color: #008080; ">22</span>&nbsp;<span style="color: #000000; ">&nbsp;&nbsp;<br /></span><span style="color: #008080; ">23</span>&nbsp;<span style="color: #000000; ">&nbsp;<br /></span><span style="color: #008080; ">24</span>&nbsp;<span style="color: #000000; ">&nbsp;</span><span style="color: #008000; ">//</span><span style="color: #008000; ">获取建议的关键字</span><span style="color: #008000; "><br /></span><span style="color: #008080; ">25</span>&nbsp;<span style="color: #008000; "></span><span style="color: #000000; ">&nbsp;String[]&nbsp;suggestions&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;sp.suggestSimilar(search,&nbsp;suggestionNumber);<br /></span><span style="color: #008080; ">26</span>&nbsp;<span style="color: #000000; ">&nbsp;&nbsp;<br /></span><span style="color: #008080; ">27</span>&nbsp;<span style="color: #000000; ">&nbsp;<br /></span><span style="color: #008080; ">28</span>&nbsp;<span style="color: #000000; ">&nbsp;</span><span style="color: #008000; ">//</span><span style="color: #008000; ">显示结果</span><span style="color: #008000; "><br /></span><span style="color: #008080; ">29</span>&nbsp;<span style="color: #008000; "></span><span style="color: #000000; ">&nbsp;System.out.println(</span><span style="color: #000000; ">"</span><span style="color: #000000; ">搜索:</span><span style="color: #000000; ">"</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">+</span><span style="color: #000000; ">&nbsp;search);<br /></span><span style="color: #008080; ">30</span>&nbsp;<span style="color: #000000; ">&nbsp;&nbsp;<br /></span><span style="color: #008080; ">31</span>&nbsp;<span style="color: #000000; ">&nbsp;<br /></span><span style="color: #008080; ">32</span>&nbsp;<span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">for</span><span style="color: #000000; ">&nbsp;(String&nbsp;word&nbsp;:&nbsp;suggestions)&nbsp;{<br /></span><span style="color: #008080; ">33</span>&nbsp;<span style="color: #000000; ">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(</span><span style="color: #000000; ">"</span><span style="color: #000000; ">你要找的是不是:</span><span style="color: #000000; ">"</span><span style="color: #000000; ">&nbsp;</span><span style="color: #000000; ">+</span><span style="color: #000000; ">&nbsp;word);<br /></span><span style="color: #008080; ">34</span>&nbsp;<span style="color: #000000; ">&nbsp;}</span></div><div><br />注：这之前你需要有个语料库，我这里是个存放正确视频名称的文件，格式如下：</div><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #000000; ">红颜血泪<br />&nbsp;冰上火一般的激情<br />&nbsp;在敌之手<br />&nbsp;驰风竞艇王第二部<br />&nbsp;钓金龟<br />&nbsp;潇湘路一号<br />&nbsp;戏里戏外第二季<br />&nbsp;草原狼爵士乐<br />&nbsp;拯救大兵瑞恩</span></div><br /><div><p>好了，接下来就直接运行吧，例如我搜索&#8220;十二生效&#8221;，则提示说是不是要找&#8220;十二生肖&#8221;</p> <p><img src="http://images.cnitblog.com/blog/408040/201301/16172826-cbe2b7c19dca4cab9ad79033987e3c84.jpg" alt="" /></p></div><div><h3>参考文献</h3> <ul><li><a href="http://lucene.apache.org/java/docs/" target="_blank">http://lucene.apache.org/java/docs/</a></li><li><a href="http://today.java.net/pub/a/today/2005/08/09/didyoumean.html" target="_blank">http://today.java.net/pub/a/today/2005/08/09/didyoumean.html</a></li><li><a href="http://archsofty.blogspot.com/2009/12/adicione-o-recurso-voce-quis-dizer-nas.html" target="_blank">http://archsofty.blogspot.com/2009/12/adicione-o-recurso-voce-quis-dizer-nas.html</a></li><li><a href="http://lucene.apache.org/java/3_0_0/api/contrib-spellchecker/index.html" target="_blank">http://lucene.apache.org/java/3_0_0/api/contrib-spellchecker/index.html</a></li><li><a href="http://en.wikipedia.org/wiki/Edit_distance" target="_blank">http://en.wikipedia.org/wiki/Edit_distance</a></li><li><a href="http://en.wikipedia.org/wiki/Levenshtein_distance" target="_blank">http://en.wikipedia.org/wiki/Levenshtein_distance</a></li><li><a href="http://en.wikipedia.org/wiki/Jaro-Winkler_distance" target="_blank">http://en.wikipedia.org/wiki/Jaro-Winkler_distance</a></li></ul></div>原文出自：<div>http://www.cnblogs.com/wuren/archive/2013/01/16/2862873.html</div><br /><img src ="http://www.blogjava.net/xiaohuzi2008/aggbug/394314.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/xiaohuzi2008/" target="_blank">小胡子</a> 2013-01-16 17:39 <a href="http://www.blogjava.net/xiaohuzi2008/archive/2013/01/16/394314.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>【转】Heritrix 绑定IP、启动参数、中文乱码</title><link>http://www.blogjava.net/xiaohuzi2008/archive/2012/12/21/393290.html</link><dc:creator>小胡子</dc:creator><author>小胡子</author><pubDate>Fri, 21 Dec 2012 04:34:00 GMT</pubDate><guid>http://www.blogjava.net/xiaohuzi2008/archive/2012/12/21/393290.html</guid><wfw:comment>http://www.blogjava.net/xiaohuzi2008/comments/393290.html</wfw:comment><comments>http://www.blogjava.net/xiaohuzi2008/archive/2012/12/21/393290.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/xiaohuzi2008/comments/commentRss/393290.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/xiaohuzi2008/services/trackbacks/393290.html</trackback:ping><description><![CDATA[<div><div> <h1>Heritrix<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">资源</span></h1> <p><span style="font-family: 宋体;'Times New Roman';'Times New Roman'">网络上的</span>Heritrix<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">中文资源比较少，整理一下：</span></p> <p>&nbsp;</p> <p><span style="font-family: 宋体;'Times New Roman';'Times New Roman'">中文：</span></p> <p style="margin: 0cm 0cm 0pt 21pt; text-indent: -21pt;"><span style="font-family: Wingdings;">l<span style="font: 7pt 'Times New Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style="font-family: 宋体;'Times New Roman';'Times New Roman'">《开发自己的搜索引擎</span> Lucene 2.0 + Heritrix<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">》作者邱哲</span>&amp;<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">符滔滔的</span>BLOG</p> <p style=" text-indent: 21pt"><a href="http://lucenebook.spaces.live.com/">http://lucenebook.spaces.live.com/</a></p> <p>&nbsp;</p> <p style="margin: 0cm 0cm 0pt 21pt; text-indent: -21pt;"><span style="font-family: Wingdings;">l<span style="font: 7pt 'Times New Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style="font-family: 宋体;'Times New Roman';'Times New Roman'">《开发自己的搜索引擎</span> Lucene 2.0 + Heriterx<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">》</span><span style="font-family: 宋体;">第十章扩展</span><span style="font-family: Tahoma;">Heritrix</span><span style="font-family: 宋体;">试读章节</span></p> <p style=" text-indent: 21pt"><span style="font-family: Tahoma;">(</span><span style="font-family: 宋体;">可以考虑开发的，比较有用</span><span style="font-family: Tahoma;">)</span></p> <p style=" text-indent: 21pt"><a href="http://book.csdn.net/bookfiles/312/10031212848.shtml">http://book.csdn.net/bookfiles/312/10031212848.shtml</a></p> <p style=" text-indent: 21pt">&nbsp;</p> <p style="margin: 0cm 0cm 0pt 21pt; text-indent: -21pt;"><span style="font-family: Wingdings;">l<span style="font: 7pt 'Times New Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span>Heritrix<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">笔记</span></p> <p style=" text-indent: 21pt"><a href="http://wiki.hoodong.com/wiki/jRwNBCFgWA1dYB0NC">http://wiki.hoodong.com/wiki/jRwNBCFgWA1dYB0NC</a></p> <p style=" text-indent: 21pt">&nbsp;</p> <p style="margin: 0cm 0cm 0pt 21pt; text-indent: -21pt;"><span style="font-family: Wingdings;">l<span style="font: 7pt 'Times New Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span>Heritrix crawler vs Nutch crawler</p> <p style=" text-indent: 21pt"><a href="http://www.dbanotes.net/web/heritrix_crawler_vs_nutch_crawler.html">http://www.dbanotes.net/web/heritrix_crawler_vs_nutch_crawler.html</a></p> <p style=" text-indent: 21pt">&nbsp;</p> <p style="margin: 0cm 0cm 0pt 21pt; text-indent: -21pt;"><span style="font-family: Wingdings;">l<span style="font: 7pt 'Times New Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style="font-family: 宋体;'Times New Roman';'Times New Roman'">天下维客</span>-<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">爬虫程序</span></p> <p style=" text-indent: 21pt"><a href="http://www.allwiki.com/wiki/Heritrix#Heritrix.E7.9A.84.E5.B1.80.E9.99.90">http://www.allwiki.com/wiki/Heritrix#Heritrix.E7.9A.84.E5.B1.80.E9.99.90</a></p> <p style=" text-indent: 21pt">&nbsp;</p> <p><span style="font-family: 宋体;'Times New Roman';'Times New Roman'">英文：</span></p> <p style="margin: 0cm 0cm 0pt 21pt; text-indent: -21pt;"><span style="font-family: Wingdings;">l<span style="font: 7pt 'Times New Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span>Heritrix<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">主页</span></p> <p style=" text-indent: 21pt"><a href="http://crawler.archive.org/">http://crawler.archive.org/</a></p> <p style=" text-indent: 21pt">&nbsp;</p> <p style="margin: 0cm 0cm 0pt 21pt; text-indent: -21pt;"><span style="font-family: Wingdings;">l<span style="font: 7pt 'Times New Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span>HTMLParser<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">主页</span></p> <p style=" text-indent: 21pt"><a href="http://htmlparser.sourceforge.net/">http://htmlparser.sourceforge.net/</a></p></div>  <div>&nbsp;</div> <div>&nbsp;</div> <div> <h1>Heritrix<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">绑定主机</span>IP</h1> <p><span style="font-family: 宋体;'Times New Roman';'Times New Roman';">关键字：</span>Heritrix 127.0.0.1 IP <span style="font-family: 宋体;'Times New Roman';'Times New Roman';">主机</span></p> <p>&nbsp;</p> <p style=" text-indent: 21pt">Heritrix<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">默认绑定的</span>IP<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">是</span>127.0.0.1<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">。</span></p> <p style=" text-indent: 21pt"><span style="font-family: 宋体;'Times New Roman';'Times New Roman';">在</span><span style="font-family: 宋体;'Times New Roman';">org.archive.crawler.Heritrix</span><span style="font-family: 宋体;'Times New Roman';">中</span></p> <p style=" text-indent: 21pt"><span style="font-family: 宋体;'Times New Roman';"></span>&nbsp;</p> <table style="border-right: medium none; border-top: medium none; border-left: medium none; border-bottom: medium none; border-collapse: collapse;" border="1" cellpadding="0" cellspacing="0"> <tbody> <tr> <td style="border-right: windowtext 1pt solid; padding-right: 5.4pt; border-top: windowtext 1pt solid; padding-left: 5.4pt; padding-bottom: 0cm; border-left: windowtext 1pt solid; width: 426.1pt; padding-top: 0cm; border-bottom: windowtext 1pt solid; background-color: transparent;" valign="top" width="568"> <p><span style="color: green;">&#8230;</span></p> <p><span style="color: green;">final private static Collection&lt;String&gt; LOCALHOST_ONLY =</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp; Collections.unmodifiableList(Arrays.asList(new String[] { "127.0.0.1" }));</span></p> <p><span style="color: green;">&#8230;</span></p> <p><span style="color: green;">private static Collection&lt;String&gt; guiHosts = LOCALHOST_ONLY;</span></p> <p>&nbsp;</p> <p><span style="color: green;">protected static String doCmdLineArgs(final String [] args)</span></p> <p style=" text-indent: 21.75pt"><span style="color: green;">throws Exception {</span></p> <p style=" text-indent: 21.75pt"><span style="color: green;">&#8230;</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp; // Now look at options passed.</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for (int i = 0; i &lt; options.length; i++) {</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; switch(options[i].getId()) {</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8230;</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; case 'b':</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Heritrix.guiHosts = parseHosts(options[i].getValue());</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; break;</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8230;</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; default:</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; assert false: options[i].getId();</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</span></p> <p><span style="color: green;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</span></p> <p style=" text-indent: 21.75pt"><span style="color: green;">&#8230;</span></p> <p><span style="color: green;">}</span></p></td></tr></tbody></table> <p style=" text-indent: 21pt">&nbsp;</p> <p style=" text-indent: 21pt"><span style="font-family: 宋体;'Times New Roman';'Times New Roman';">首先定义了默认</span>IP<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">：</span>127.0.0.1<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">，然后赋给</span>guiHost<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">主机变量。当指定</span>-b<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">或</span>--bind<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">参数时，才会把指定的</span>IP<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">赋给主机变量。</span></p> <p style=" text-indent: 21pt"><span style="font-family: 宋体;'Times New Roman';'Times New Roman';">另外，中间还有一步参数处理，对于</span>--xxxx<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">参数会转为</span>-x<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">的形式统一处理，所以</span>--bind<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">和</span>-b<span style="font-family: 宋体;'Times New Roman';'Times New Roman';">有一样的效果。</span></p></div> <div>&nbsp;</div> <div> <h1>Heritrix<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">启动参数</span></h1><span> <p><span style="font-family: 宋体;'Times New Roman';'Times New Roman'">关键字：</span>Heritrix <span style="font-family: 宋体;'Times New Roman';'Times New Roman'">启动</span> <span style="font-family: 宋体;'Times New Roman';'Times New Roman'">参数</span> bind admin properties</p> <p>&nbsp;</p></span> <p style=" text-indent: 21pt">Heritrix<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">的启动参数，除了</span>--bind<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">外，都可以在</span>heritrix.properties<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">设置，而不用每次都在命令行中输入。</span></p> <p style=" text-indent: 21pt"><span style="font-family: 宋体;'Times New Roman';'Times New Roman'">如常用的</span>--port, --admin<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">等。</span></p> <p style=" text-indent: 21pt">&nbsp;</p> <table style="border-right: medium none; border-top: medium none; border-left: medium none; border-bottom: medium none; border-collapse: collapse;" border="1" cellpadding="0" cellspacing="0"> <tbody> <tr> <td style="border-right: windowtext 1pt solid; padding-right: 5.4pt; border-top: windowtext 1pt solid; padding-left: 5.4pt; padding-bottom: 0cm; border-left: windowtext 1pt solid; width: 426.1pt; padding-top: 0cm; border-bottom: windowtext 1pt solid; background-color: transparent;" valign="top" width="568"> <p><span style="color: green">heritrix.cmdline.admin = admin:admin</span></p> <p><span style="color: green">heritrix.cmdline.port = 8080</span></p> <p><span style="color: green">heritrix.cmdline.run = false</span></p> <p><span style="color: green">heritrix.cmdline.nowui = false</span></p> <p><span style="color: green">heritrix.cmdline.order =</span></p> <p><span style="color: green">heritrix.cmdline.jmxserver = false</span></p> <p><span style="color: green">heritrix.cmdline.jmxserver.port = 8081</span></p></td></tr></tbody></table> <p>&nbsp;</p><span> <h1><span style="font-family: 宋体;'Times New Roman';'Times New Roman'">关于</span>Heritrix<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">的</span>Extractor<span style="font-family: 宋体;'Times New Roman';'Times New Roman'">中文乱码</span></h1> <p><span style="font-family: 宋体;'Times New Roman';'Times New Roman'">关键字：</span>Heritrix <span style="font-family: 宋体;'Times New Roman';'Times New Roman'">中文</span> <span style="font-family: 宋体;'Times New Roman';'Times New Roman'">乱码</span> GB2312 Extractor</p> <p>&nbsp;</p> <p style=" text-indent: 21pt"><span style="font-family: 宋体;'Times New Roman';'Times New Roman';">继承从</span><span style="font-family: 宋体;'Times New Roman';">org.archive.crawler.extractor.Extractor</span><span style="font-family: 宋体;'Times New Roman';">的子类，在extract方法中可以从参数CrawlURI中取出要解析的内容。</span></p> <p style=" text-indent: 21pt"><span style="font-family: 宋体;'Times New Roman';"></span>&nbsp;</p> <p> </p><table style="border-right: medium none; border-top: medium none; border-left: medium none; border-bottom: medium none; border-collapse: collapse;" border="1" cellpadding="0" cellspacing="0"> <tbody> <tr> <td style="border-right: windowtext 1pt solid; padding-right: 5.4pt; border-top: windowtext 1pt solid; padding-left: 5.4pt; padding-bottom: 0cm; border-left: windowtext 1pt solid; width: 426.1pt; padding-top: 0cm; border-bottom: windowtext 1pt solid; background-color: transparent;" valign="top" width="568"> <p><span style="color: green; font-family: 宋体;'Times New Roman';">curi.getHttpRecorder().getReplayCharSequence.toString()</span></p></td></tr></tbody></table> <p style=" text-indent: 21pt">&nbsp;</p> <p style=" text-indent: 21pt"><span style="font-family: 宋体;'Times New Roman';">有中文时，不做处理会输出乱码。可以在取到的HttpRecorder后设置编码：</span></p> <p style=" text-indent: 21pt"><span style="font-family: 宋体;'Times New Roman';"></span>&nbsp;</p> <p> </p><table style="border-right: medium none; border-top: medium none; border-left: medium none; border-bottom: medium none; border-collapse: collapse;" border="1" cellpadding="0" cellspacing="0"> <tbody> <tr> <td style="border-right: windowtext 1pt solid; padding-right: 5.4pt; border-top: windowtext 1pt solid; padding-left: 5.4pt; padding-bottom: 0cm; border-left: windowtext 1pt solid; width: 426.1pt; padding-top: 0cm; border-bottom: windowtext 1pt solid; background-color: transparent;" valign="top" width="568"> <p><span style="color: green; font-family: 宋体;'Times New Roman';">HttpRecorder hr = curi.getHttpRecorder();</span></p> <p><span style="color: green; font-family: 宋体;'Times New Roman';">if ( hr == null ) {</span></p> <p><span style="color: green; font-family: 宋体;'Times New Roman';">&nbsp;&nbsp;&nbsp; throw new IOException( "Why is recorder null here?" );</span></p> <p><span style="color: green; font-family: 宋体;'Times New Roman';">}</span></p> <p><span style="color: green; font-family: 宋体;'Times New Roman';">hr.setCharacterEncoding( "gb2312" );</span></p> <p><span style="color: green; font-family: 宋体;'Times New Roman';">cs = hr.getReplayCharSequence();</span></p> <p><span style="color: green;">System.out.println( cs.toString() );</span></p></td></tr></tbody></table> <p>&nbsp;</p> <p> 原文出自：<div>http://blog.chinaunix.net/uid-8464637-id-2461166.html</div><br /></p></span></div></div><img src ="http://www.blogjava.net/xiaohuzi2008/aggbug/393290.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/xiaohuzi2008/" target="_blank">小胡子</a> 2012-12-21 12:34 <a href="http://www.blogjava.net/xiaohuzi2008/archive/2012/12/21/393290.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>