﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-willpower88-随笔分类-搜索</title><link>http://www.blogjava.net/willpower88/category/37510.html</link><description>对JAVA有点理解了……</description><language>zh-cn</language><lastBuildDate>Mon, 09 Feb 2009 08:36:13 GMT</lastBuildDate><pubDate>Mon, 09 Feb 2009 08:36:13 GMT</pubDate><ttl>60</ttl><item><title>lucene2.0+heritrix示例补充</title><link>http://www.blogjava.net/willpower88/archive/2009/02/09/253914.html</link><dc:creator>一凡</dc:creator><author>一凡</author><pubDate>Mon, 09 Feb 2009 07:44:00 GMT</pubDate><guid>http://www.blogjava.net/willpower88/archive/2009/02/09/253914.html</guid><wfw:comment>http://www.blogjava.net/willpower88/comments/253914.html</wfw:comment><comments>http://www.blogjava.net/willpower88/archive/2009/02/09/253914.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/willpower88/comments/commentRss/253914.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/willpower88/services/trackbacks/253914.html</trackback:ping><description><![CDATA[　　由于lucene2.0+heritrix一书示例用的网站(http://mobile.pconline.com.cn/,http:
//mobile.163.com/)改版了，书上实例不能运行，我又找了一个http://mobile.younet.com/进行开发并成功实现示
例，希望感兴趣的同学，近快实践，如果此网站也改了就又得改extractor了，哈哈！
<br />
search的Extractor代码如下，（别和书上实例相同）供大家参考：附件里有完整代码
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #0000ff;">package</span><span style="color: #000000;">&nbsp;com.luceneheritrixbook.extractor.younet;<br />
<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;java.io.BufferedWriter;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;java.io.File;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;java.io.FileWriter;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;java.io.IOException;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;java.util.Date;<br />
<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;org.htmlparser.Node;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;org.htmlparser.NodeFilter;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;org.htmlparser.Parser;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;org.htmlparser.filters.AndFilter;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;org.htmlparser.filters.HasAttributeFilter;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;org.htmlparser.filters.HasChildFilter;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;org.htmlparser.filters.TagNameFilter;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;org.htmlparser.tags.ImageTag;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;org.htmlparser.util.NodeIterator;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;org.htmlparser.util.NodeList;<br />
<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;com.luceneheritrixbook.extractor.Extractor;<br />
</span><span style="color: #0000ff;">import</span><span style="color: #000000;">&nbsp;com.luceneheritrixbook.util.StringUtils;<br />
<br />
</span><span style="color: #008000;">/**</span><span style="color: #008000;"><br />
&nbsp;*&nbsp;&lt;p&gt;&lt;/p&gt;<br />
&nbsp;*&nbsp;</span><span style="color: #808080;">@author</span><span style="color: #008000;">&nbsp;cnyqiao@hotmail.com<br />
&nbsp;*&nbsp;@date&nbsp;&nbsp;&nbsp;Feb&nbsp;6,&nbsp;2009&nbsp;<br />
&nbsp;</span><span style="color: #008000;">*/</span><span style="color: #000000;"><br />
<br />
</span><span style="color: #0000ff;">public</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">class</span><span style="color: #000000;">&nbsp;ExtractYounetMoblie&nbsp;</span><span style="color: #0000ff;">extends</span><span style="color: #000000;">&nbsp;Extractor&nbsp;{<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;@Override<br />
&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">public</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">void</span><span style="color: #000000;">&nbsp;extract()&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;BufferedWriter&nbsp;bw&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">null</span><span style="color: #000000;">;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;NodeFilter&nbsp;title_filter&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;AndFilter(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;TagNameFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">div</span><span style="color: #000000;">"</span><span style="color: #000000;">),&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;HasAttributeFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">class</span><span style="color: #000000;">"</span><span style="color: #000000;">,&nbsp;</span><span style="color: #000000;">"</span><span style="color: #000000;">mo_tit</span><span style="color: #000000;">"</span><span style="color: #000000;">));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;NodeFilter&nbsp;attribute_filter&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;AndFilter(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;TagNameFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">p</span><span style="color: #000000;">"</span><span style="color: #000000;">),&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;HasChildFilter(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;AndFilter(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;TagNameFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">span</span><span style="color: #000000;">"</span><span style="color: #000000;">),&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;HasAttributeFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">class</span><span style="color: #000000;">"</span><span style="color: #000000;">,&nbsp;</span><span style="color: #000000;">"</span><span style="color: #000000;">gn_sp1&nbsp;blue1</span><span style="color: #000000;">"</span><span style="color: #000000;">))));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;NodeFilter&nbsp;img_filter&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;AndFilter(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;TagNameFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">span</span><span style="color: #000000;">"</span><span style="color: #000000;">),&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;HasChildFilter(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;TagNameFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">img</span><span style="color: #000000;">"</span><span style="color: #000000;">)));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">提取标题信息</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">try</span><span style="color: #000000;">&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">Parser根据过滤器返回所有满足过滤条件的节点<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">&nbsp;迭代逐渐查找</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;NodeList&nbsp;nodeList</span><span style="color: #000000;">=</span><span style="color: #0000ff;">this</span><span style="color: #000000;">.getParser().parse(title_filter);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;NodeIterator&nbsp;it&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;nodeList.elements();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;StringBuffer&nbsp;title&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;StringBuffer();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">while</span><span style="color: #000000;">&nbsp;(it.hasMoreNodes())&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Node&nbsp;node&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;(Node)&nbsp;it.nextNode();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String[]&nbsp;names&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;node.toPlainTextString().split(</span><span style="color: #000000;">"</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">"</span><span style="color: #000000;">);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">for</span><span style="color: #000000;">(</span><span style="color: #0000ff;">int</span><span style="color: #000000;">&nbsp;i&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">0</span><span style="color: #000000;">;&nbsp;i&nbsp;</span><span style="color: #000000;">&lt;</span><span style="color: #000000;">&nbsp;names.length;&nbsp;i</span><span style="color: #000000;">++</span><span style="color: #000000;">)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;title.append(names[i]).append(</span><span style="color: #000000;">"</span><span style="color: #000000;">-</span><span style="color: #000000;">"</span><span style="color: #000000;">);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;title.append(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;Date().getTime());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">创建要生成的文件</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bw&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;BufferedWriter(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;FileWriter(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;File(</span><span style="color: #0000ff;">this</span><span style="color: #000000;">.getOutputPath()&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;title&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">"</span><span style="color: #000000;">.txt</span><span style="color: #000000;">"</span><span style="color: #000000;">)));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">获取当前提取页的完整URL地址</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">int</span><span style="color: #000000;">&nbsp;startPos&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">this</span><span style="color: #000000;">.getInuputFilePath().indexOf(</span><span style="color: #000000;">"</span><span style="color: #000000;">mirror</span><span style="color: #000000;">"</span><span style="color: #000000;">)&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">6</span><span style="color: #000000;">;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;url_seg&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">this</span><span style="color: #000000;">.getInuputFilePath().substring(startPos);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;url_seg&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;url_seg.replaceAll(</span><span style="color: #000000;">"</span><span style="color: #000000;">\\\\</span><span style="color: #000000;">"</span><span style="color: #000000;">,&nbsp;</span><span style="color: #000000;">"</span><span style="color: #000000;">/</span><span style="color: #000000;">"</span><span style="color: #000000;">);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;url&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">"</span><span style="color: #000000;">http:/</span><span style="color: #000000;">"</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;url_seg;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">写入当前提取页的完整URL地址</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bw.write(url&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;NEWLINE);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bw.write(names[</span><span style="color: #000000;">0</span><span style="color: #000000;">]&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;NEWLINE);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bw.write(names[</span><span style="color: #000000;">1</span><span style="color: #000000;">]&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;NEWLINE);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">&nbsp;重置Parser</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">this</span><span style="color: #000000;">.getParser().reset();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Parser&nbsp;attNameParser&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">null</span><span style="color: #000000;">;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Parser&nbsp;attValueParser&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">null</span><span style="color: #000000;">;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">Parser&nbsp;parser=new&nbsp;Parser("</span><span style="color: #008000; text-decoration: underline;">http://www.sina.com.cn</span><span style="color: #008000;">");</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;NodeFilter&nbsp;attributeName_filter&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;AndFilter(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;TagNameFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">span</span><span style="color: #000000;">"</span><span style="color: #000000;">),&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;HasAttributeFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">class</span><span style="color: #000000;">"</span><span style="color: #000000;">,&nbsp;</span><span style="color: #000000;">"</span><span style="color: #000000;">gn_sp1&nbsp;blue1</span><span style="color: #000000;">"</span><span style="color: #000000;">));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;NodeFilter&nbsp;attributeValue_filter&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;AndFilter(</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;TagNameFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">span</span><span style="color: #000000;">"</span><span style="color: #000000;">),&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;HasAttributeFilter(</span><span style="color: #000000;">"</span><span style="color: #000000;">class</span><span style="color: #000000;">"</span><span style="color: #000000;">,&nbsp;</span><span style="color: #000000;">"</span><span style="color: #000000;">gn_sp2</span><span style="color: #000000;">"</span><span style="color: #000000;">));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;attName&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">""</span><span style="color: #000000;">;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;attValue&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">""</span><span style="color: #000000;">;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">&nbsp;迭代逐渐查找</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;nodeList</span><span style="color: #000000;">=</span><span style="color: #0000ff;">this</span><span style="color: #000000;">.getParser().parse(attribute_filter);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;it&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;nodeList.elements();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">while</span><span style="color: #000000;">&nbsp;(it.hasMoreNodes())&nbsp;{&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Node&nbsp;node&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;(Node)&nbsp;it.nextNode();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;attNameParser&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;Parser();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;attNameParser.setEncoding(</span><span style="color: #000000;">"</span><span style="color: #000000;">GB2312</span><span style="color: #000000;">"</span><span style="color: #000000;">);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;attNameParser.setInputHTML(node.toHtml());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;NodeList&nbsp;attNameNodeList&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;attNameParser.parse(attributeName_filter);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;attName&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;attNameNodeList.elements().nextNode().toPlainTextString();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;attValueParser&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">new</span><span style="color: #000000;">&nbsp;Parser();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;attValueParser.setEncoding(</span><span style="color: #000000;">"</span><span style="color: #000000;">GB2312</span><span style="color: #000000;">"</span><span style="color: #000000;">);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;attValueParser.setInputHTML(node.toHtml());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;NodeList&nbsp;attValueNodeList&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;attValueParser.parse(attributeValue_filter);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;attValue&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;attValueNodeList.elements().nextNode().toPlainTextString();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bw.write(attName.trim()&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;attValue.trim());<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bw.newLine();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">&nbsp;重置Parser</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">this</span><span style="color: #000000;">.getParser().reset();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;imgUrl&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">""</span><span style="color: #000000;">;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;fileType&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">""</span><span style="color: #000000;">;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">&nbsp;迭代逐渐查找</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;nodeList</span><span style="color: #000000;">=</span><span style="color: #0000ff;">this</span><span style="color: #000000;">.getParser().parse(img_filter);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;it&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;nodeList.elements();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">while</span><span style="color: #000000;">&nbsp;(it.hasMoreNodes())&nbsp;{&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Node&nbsp;node&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;(Node)&nbsp;it.nextNode();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ImageTag&nbsp;imgNode&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;(ImageTag)node.getChildren().elements().nextNode();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;imgUrl&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;imgNode.getAttribute(</span><span style="color: #000000;">"</span><span style="color: #000000;">src</span><span style="color: #000000;">"</span><span style="color: #000000;">);&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;fileType&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;imgUrl.trim().substring(imgUrl<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.lastIndexOf(</span><span style="color: #000000;">"</span><span style="color: #000000;">.</span><span style="color: #000000;">"</span><span style="color: #000000;">)&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">1</span><span style="color: #000000;">);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">生成新的图片的文件名</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;new_iamge_file&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;StringUtils.encodePassword(imgUrl,&nbsp;HASH_ALGORITHM)&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">"</span><span style="color: #000000;">.</span><span style="color: #000000;">"</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;fileType;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">imgUrl&nbsp;=&nbsp;new&nbsp;HtmlPaserFilterTest().replace(new_iamge_file,&nbsp;"+",&nbsp;"&nbsp;");<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">//</span><span style="color: #008000;">利用miorr目录下的图片生成的新的图片</span><span style="color: #008000;"><br />
</span><span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">this</span><span style="color: #000000;">.copyImage(imgUrl,&nbsp;new_iamge_file);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bw.write(SEPARATOR&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;NEWLINE);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bw.write(new_iamge_file&nbsp;</span><span style="color: #000000;">+</span><span style="color: #000000;">&nbsp;NEWLINE);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}&nbsp;</span><span style="color: #0000ff;">catch</span><span style="color: #000000;">(Exception&nbsp;e)&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e.printStackTrace();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}&nbsp;</span><span style="color: #0000ff;">finally</span><span style="color: #000000;">&nbsp;{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">try</span><span style="color: #000000;">{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">if</span><span style="color: #000000;">&nbsp;(bw&nbsp;</span><span style="color: #000000;">!=</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">null</span><span style="color: #000000;">)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bw.close();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span><span style="color: #0000ff;">catch</span><span style="color: #000000;">(IOException&nbsp;e){<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e.printStackTrace();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;}<br />
}<br />
</span></div>
运行书上的heritrix实例，并按书上的默认设置进行抓取如下ＵＲＩ：（请自己分析整理）<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #000000;">http://mobile.younet.com/files/list_1.html<br />
http://mobile.younet.com/files/list_2.html<br />
http://mobile.younet.com/files/list_3.html<br />
<img src="http://www.blogjava.net/Images/dot.gif"  alt="" /></span></div>
<br />
<img src ="http://www.blogjava.net/willpower88/aggbug/253914.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/willpower88/" target="_blank">一凡</a> 2009-02-09 15:44 <a href="http://www.blogjava.net/willpower88/archive/2009/02/09/253914.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>