﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-Java天空 任我翱翔-随笔分类-Lucene,Nutch,Hadoop</title><link>http://www.blogjava.net/persister/category/38140.html</link><description /><language>zh-cn</language><lastBuildDate>Thu, 16 Sep 2010 17:00:17 GMT</lastBuildDate><pubDate>Thu, 16 Sep 2010 17:00:17 GMT</pubDate><ttl>60</ttl><item><title>Hadoop学习笔记（一）</title><link>http://www.blogjava.net/persister/archive/2010/03/12/315306.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Fri, 12 Mar 2010 12:59:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2010/03/12/315306.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/315306.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2010/03/12/315306.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/315306.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/315306.html</trackback:ping><description><![CDATA[今天将Hadoop下载下来学习了一下文档中的tutorial，然后仿照如下链接实现了一个word count的例子：<br />
<h1><a href="http://www.ibm.com/developerworks/cn/linux/l-hadoop-1/">用 Hadoop 进行分布式数据处理，第 1 部分: 入门</a></h1>
<br />
以下是一部分理论学习：<br />
The storage is provided by HDFS, and analysis by MapReduce.<br />
<br />
MapReduce is a good fit for problems<br />
that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.<br />
An RDBMS is good for point queries or updates, where the dataset has been indexed<br />
to deliver low-latency retrieval and update times of a relatively small amount of<br />
data.<br />
MapReduce suits applications where the data is written once, and read many<br />
times, whereas a relational database is good for datasets that are continually updated.<br />
<br />
MapReduce tries to colocate the data with the compute node, so data access is fast<br />
since it is local.* This feature, known as data locality, is at the heart of MapReduce and<br />
is the reason for its good performance.<br />
<br />
Hadoop divides the input to a MapReduce job into fixed-size pieces called input<br />
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined<br />
map function for each record in the split.<br />
<br />
On the other hand, if splits are too small, then the overhead of managing the splits and<br />
of map task creation begins to dominate the total job execution time.For most jobs, a<br />
good split size tends to be the size of a HDFS block, 64 MB by default.<br />
<br />
Reduce tasks don&#8217;t have the advantage of data locality—the input to a single reduce<br />
task is normally the output from all mappers.<br />
<br />
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays<br />
to minimize the data transferred between map and reduce tasks. Hadoop allows the<br />
user to specify a combiner function to be run on the map output—the combiner function&#8217;s<br />
output forms the input to the reduce function.<br />
<br />
Why Is a Block in HDFS So Large?<br />
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost<br />
of seeks. By making a block large enough, the time to transfer the data from the disk<br />
can be made to be significantly larger than the time to seek to the start of the block.<br />
Thus the time to transfer a large file made of multiple blocks operates at the disk transfer<br />
rate.<br />
A quick calculation shows that if the seek time is around 10ms, and the transfer rate is<br />
100 MB/s, then to make the seek time 1% of the transfer time, we need to make the<br />
block size around 100 MB. The default is actually 64 MB, although many HDFS installations<br />
use 128 MB blocks. This figure will continue to be revised upward as transfer<br />
speeds grow with new generations of disk drives.<br />
This argument shouldn&#8217;t be taken too far, however. Map tasks in MapReduce normally<br />
operate on one block at a time, so if you have too few tasks (fewer than nodes in the<br />
cluster), your jobs will run slower than they could otherwise.<br />
意思是这样的，Block大的话，寻找Block的时间大概少，主要耗在传输的时间上，但是如果Block小的话，传输的时间和寻址的时间就相当了，等于说就是消耗的时间是2倍传输的时间，划不来。具体的说是，如果数据量为100M，那么Block的大小是100M，那么传输的时间就是1s(100M/s)，但是如果Block的大小是1M，那么传输的时间还是1s，但是seek的时间10ms*100=1s了。这样总共花去的时间就是2s。是不是越大越好呢？也不是，太大的话，可能导致文档没有分布式的存储，也就没有很好的利用MapReduce模型进行计算了，反而可能更慢。<br />
<br />
<br />
<img src ="http://www.blogjava.net/persister/aggbug/315306.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2010-03-12 20:59 <a href="http://www.blogjava.net/persister/archive/2010/03/12/315306.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Lucene数据存储结构中的VInt（可变长度整型）</title><link>http://www.blogjava.net/persister/archive/2010/02/02/311642.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Tue, 02 Feb 2010 03:08:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2010/02/02/311642.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/311642.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2010/02/02/311642.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/311642.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/311642.html</trackback:ping><description><![CDATA[<blockquote>
<p>
A variable-length format for positive integers is
defined where the high-order bit of each byte indicates whether more
bytes remain to be read.  The low-order seven bits are appended as
increasingly more significant bits in the resulting integer value.
Thus values from zero to 127 may be stored in a single byte, values
from 128 to 16,383 may be stored in two bytes, and so on. <br />
</p>
<p>可变格式的整型定义：最高位表示是否还有字节要读取，低七位就是就是具体的有效位，添加到</p>
<p>结果数据中。比如00000001 最高位表示0，那么说明这个数就是一个字节表示，有效位是后面的七位0000001，值为1。10000010 00000001 第一个字节最高位为1，表示后面还有字节，第二位最高位0表示到此为止了，即就是两个字节，那么具体的值注意，是从最后一个字节的七位有效数放在最前面，依次放置，最后是第一个自己的七位有效位，所以这个数表示 0000001 0000010，换算成整数就是130<br />
</p>
<p><strong>VInt Encoding Example</strong></p>
<table>
    <tbody>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT"><strong>Value</strong>
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT"><strong>First byte</strong>
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT"><strong>Second byte</strong>
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT"><strong>Third byte</strong>
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">0
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            00000000
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">1
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            00000001
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">2
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            00000010
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">...
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">127
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            01111111
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">128
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            10000000
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT">
            00000001
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">129
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            10000001
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT">
            00000001
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">130
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            10000010
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT">
            00000001
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">...
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">16,383
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            11111111
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT">
            01111111
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT"><br />
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">16,384
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            10000000
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT">
            10000000
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT">
            00000001
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">16,385
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            10000001
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT">
            10000000
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT">
            00000001
            </p>
            </font>
            </td>
        </tr>
        <tr>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p align="RIGHT">...
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: 0.11cm; margin-right: 0.01cm;" align="RIGHT">
            <br />
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.07cm; margin-right: 0.01cm;" align="RIGHT">
            <br />
            </p>
            </font>
            </td>
            <td colspan="" rowspan="" valign="top" align="left" bgcolor="#a0ddf0">
            <font color="#000000" face="arial,helvetica,sanserif" size="-1">
            <p style="margin-left: -0.47cm; margin-right: 0.01cm;" align="RIGHT">
            <br />
            </p>
            </font>
            </td>
        </tr>
    </tbody>
</table>
<p><br />
</p>
<p>Lucene源代码中进行存储和读取是这样的。OutputStream是负责写：</p>
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #008080;">&nbsp;1</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;</span><span style="color: #008000;">/**</span><span style="color: #008000;">&nbsp;Writes&nbsp;an&nbsp;int&nbsp;in&nbsp;a&nbsp;variable-length&nbsp;format.&nbsp;&nbsp;Writes&nbsp;between&nbsp;one&nbsp;and<br />
</span><span style="color: #008080;">&nbsp;2</span>&nbsp;<span style="color: #008000;">&nbsp;&nbsp;&nbsp;*&nbsp;five&nbsp;bytes.&nbsp;&nbsp;Smaller&nbsp;values&nbsp;take&nbsp;fewer&nbsp;bytes.&nbsp;&nbsp;Negative&nbsp;numbers&nbsp;are&nbsp;not<br />
</span><span style="color: #008080;">&nbsp;3</span>&nbsp;<span style="color: #008000;">&nbsp;&nbsp;&nbsp;*&nbsp;supported.<br />
</span><span style="color: #008080;">&nbsp;4</span>&nbsp;<span style="color: #008000;">&nbsp;&nbsp;&nbsp;*&nbsp;</span><span style="color: #808080;">@see</span><span style="color: #008000;">&nbsp;InputStream#readVInt()<br />
</span><span style="color: #008080;">&nbsp;5</span>&nbsp;<span style="color: #008000;">&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">*/</span><span style="color: #000000;"><br />
</span><span style="color: #008080;">&nbsp;6</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;</span><span style="color: #0000ff;">public</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">final</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">void</span><span style="color: #000000;">&nbsp;writeVInt(</span><span style="color: #0000ff;">int</span><span style="color: #000000;">&nbsp;i)&nbsp;</span><span style="color: #0000ff;">throws</span><span style="color: #000000;">&nbsp;IOException&nbsp;{<br />
</span><span style="color: #008080;">&nbsp;7</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">while</span><span style="color: #000000;">&nbsp;((i&nbsp;</span><span style="color: #000000;">&amp;</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">~</span><span style="color: #000000;">0x7F</span><span style="color: #000000;">)&nbsp;</span><span style="color: #000000;">!=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">0</span><span style="color: #000000;">)&nbsp;{<br />
</span><span style="color: #008080;">&nbsp;8</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;writeByte((</span><span style="color: #0000ff;">byte</span><span style="color: #000000;">)((i&nbsp;</span><span style="color: #000000;">&amp;</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">0x7f</span><span style="color: #000000;">)&nbsp;</span><span style="color: #000000;">|</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">0x80</span><span style="color: #000000;">));<br />
</span><span style="color: #008080;">&nbsp;9</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i&nbsp;</span><span style="color: #000000;">&gt;&gt;&gt;=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">7</span><span style="color: #000000;">;<br />
</span><span style="color: #008080;">10</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;}<br />
</span><span style="color: #008080;">11</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;writeByte((</span><span style="color: #0000ff;">byte</span><span style="color: #000000;">)i);<br />
</span><span style="color: #008080;">12</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;}</span></div>
<br />
InputStream负责读：<br />
<div style="border: 1px solid #cccccc; padding: 4px 5px 4px 4px; background-color: #eeeeee; font-size: 13px; width: 98%;"><!--<br />
<br />
Code highlighting produced by Actipro CodeHighlighter (freeware)<br />
http://www.CodeHighlighter.com/<br />
<br />
--><span style="color: #008080;">&nbsp;1</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;</span><span style="color: #008000;">/**</span><span style="color: #008000;">&nbsp;Reads&nbsp;an&nbsp;int&nbsp;stored&nbsp;in&nbsp;variable-length&nbsp;format.&nbsp;&nbsp;Reads&nbsp;between&nbsp;one&nbsp;and<br />
</span><span style="color: #008080;">&nbsp;2</span>&nbsp;<span style="color: #008000;">&nbsp;&nbsp;&nbsp;*&nbsp;five&nbsp;bytes.&nbsp;&nbsp;Smaller&nbsp;values&nbsp;take&nbsp;fewer&nbsp;bytes.&nbsp;&nbsp;Negative&nbsp;numbers&nbsp;are&nbsp;not<br />
</span><span style="color: #008080;">&nbsp;3</span>&nbsp;<span style="color: #008000;">&nbsp;&nbsp;&nbsp;*&nbsp;supported.<br />
</span><span style="color: #008080;">&nbsp;4</span>&nbsp;<span style="color: #008000;">&nbsp;&nbsp;&nbsp;*&nbsp;</span><span style="color: #808080;">@see</span><span style="color: #008000;">&nbsp;OutputStream#writeVInt(int)<br />
</span><span style="color: #008080;">&nbsp;5</span>&nbsp;<span style="color: #008000;">&nbsp;&nbsp;&nbsp;</span><span style="color: #008000;">*/</span><span style="color: #000000;"><br />
</span><span style="color: #008080;">&nbsp;6</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;</span><span style="color: #0000ff;">public</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">final</span><span style="color: #000000;">&nbsp;</span><span style="color: #0000ff;">int</span><span style="color: #000000;">&nbsp;readVInt()&nbsp;</span><span style="color: #0000ff;">throws</span><span style="color: #000000;">&nbsp;IOException&nbsp;{<br />
</span><span style="color: #008080;">&nbsp;7</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">byte</span><span style="color: #000000;">&nbsp;b&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;readByte();<br />
</span><span style="color: #008080;">&nbsp;8</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">int</span><span style="color: #000000;">&nbsp;i&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;b&nbsp;</span><span style="color: #000000;">&amp;</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">0x7F</span><span style="color: #000000;">;<br />
</span><span style="color: #008080;">&nbsp;9</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">for</span><span style="color: #000000;">&nbsp;(</span><span style="color: #0000ff;">int</span><span style="color: #000000;">&nbsp;shift&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">7</span><span style="color: #000000;">;&nbsp;(b&nbsp;</span><span style="color: #000000;">&amp;</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">0x80</span><span style="color: #000000;">)&nbsp;</span><span style="color: #000000;">!=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">0</span><span style="color: #000000;">;&nbsp;shift&nbsp;</span><span style="color: #000000;">+=</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">7</span><span style="color: #000000;">)&nbsp;{<br />
</span><span style="color: #008080;">10</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b&nbsp;</span><span style="color: #000000;">=</span><span style="color: #000000;">&nbsp;readByte();<br />
</span><span style="color: #008080;">11</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i&nbsp;</span><span style="color: #000000;">|=</span><span style="color: #000000;">&nbsp;(b&nbsp;</span><span style="color: #000000;">&amp;</span><span style="color: #000000;">&nbsp;</span><span style="color: #000000;">0x7F</span><span style="color: #000000;">)&nbsp;</span><span style="color: #000000;">&lt;&lt;</span><span style="color: #000000;">&nbsp;shift;<br />
</span><span style="color: #008080;">12</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;}<br />
</span><span style="color: #008080;">13</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff;">return</span><span style="color: #000000;">&nbsp;i;<br />
</span><span style="color: #008080;">14</span>&nbsp;<span style="color: #000000;">&nbsp;&nbsp;}</span></div>
<br />
&gt;&gt;&gt;表示无符号右移<br />
<p>
</p>
</blockquote>
<img src ="http://www.blogjava.net/persister/aggbug/311642.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2010-02-02 11:08 <a href="http://www.blogjava.net/persister/archive/2010/02/02/311642.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>第一次尝试Nutch</title><link>http://www.blogjava.net/persister/archive/2009/07/23/288039.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Thu, 23 Jul 2009 07:43:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2009/07/23/288039.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/288039.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2009/07/23/288039.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/288039.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/288039.html</trackback:ping><description><![CDATA[<p>环境：Nutch0.9+Fedora5+tomcat6+JDK6</p>
<p>tomcat和jdk都安装好；</p>
<p>二：nutch-0.9.tar.gz<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 将下载到的tar.gz包，解压到/opt目录下并改名：<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #gunzip -xf nutch-0.9.tar.gz |tar xf<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #mv nutch-0.9.tar.gz /usr/local/nutch<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 测试环境是否设置成功：运行：/urs/local/nutch/bin/nutch看一下有没有命令参数输出，如果有说明没问题。</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 抓取过程：#cd /opt/nutch<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #mkdir urls<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #vi nutch.txt 输入www.aicent.net<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #vi conf/crawl-urlfilter.txt 加入以下信息：利用正则表达式对网站url抓取筛选。<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /**** accept hosts in MY.DOMAIN.NAME******/<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +^http://([a-z0-9]*\.)*aicent.net/<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #vi nutch/nutch-site.xml（给自己的蜘蛛取一个名字）设置如下：<br />
&nbsp;&nbsp; &lt;configuration&gt;<br />
&lt;property&gt;<br />
&nbsp;&nbsp;&nbsp; &lt;name&gt;http.agent.name&lt;/name&gt;<br />
&nbsp;&nbsp;&nbsp; &lt;value&gt;test/unique&lt;/value&gt;<br />
&lt;/property&gt;<br />
&lt;/configuration&gt;</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 开始抓取：#bin/nutch crawl urls -dir crawl -detpth 5 -thread 10 &gt;&amp; crawl.log</p>
<p>等待一会，时间依据网站的大小，和设置的抓取深度。</p>
<p><br />
三：apache-tomcat</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 在这里，当你看到每次检索的页面为0里，需要修改一下参数，因为tomcat中的nutch的检索路径不对造成的。<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #vi /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml<br />
&lt;property&gt;<br />
&lt;name&gt;searcher.dir&lt;/name&gt;<br />
&lt;value&gt;/opt/nutch/crawl&lt;/value&gt;抓取网页所在的路径<br />
&lt;description&gt;My path to nutch's searcher dir.&lt;/description&gt;<br />
&lt;/property&gt;</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #/opt/tomcat/bin/startup.sh</p>
<p><br />
OK,搞定。。。</p>
<p><br />
问题汇总：</p>
<p><br />
运行：sh ./bin/nutch crawl urls -dir crawl -depth 3 -threads 60 -topN 100 &gt;&amp;./logs/nutch_log.log</p>
<p>1.Exception in thread "main" java.io.IOException: Job failed!<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.nutch.crawl.Injector.inject(Injector.java:162)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)<br />
网上查有说是JDK版本的问题，不能用JDK1.6，于是安装1.5。但是还是同样的问题，奇怪啊。<br />
于是继续google，发现有如下的可能：</p>
<p>Injector: Converting injected urls to crawl db entries. <br />
Exception in thread "main" java.io.IOException: Job failed! <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.nutch.crawl.Injector.inject(Injector.java:162) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)</p>
<p>说明：一般为crawl-urlfilters.txt中配置问题，比如过滤条件应为 <br />
+^http://www.ihooyo.com ,而配置成了 http://www.ihooyo.com 这样的情况就引起如上错误。</p>
<p>但是自己的配置根本就没有问题啊。<br />
在Logs目录下面除了生成nutch_log.log还自动生成一个log文件：hadoop.log<br />
发现有错误出现：</p>
<p><br />
2009-07-22 22:20:55,501 INFO&nbsp; crawl.Crawl - crawl started in: crawl<br />
2009-07-22 22:20:55,501 INFO&nbsp; crawl.Crawl - rootUrlDir = urls<br />
2009-07-22 22:20:55,502 INFO&nbsp; crawl.Crawl - threads = 60<br />
2009-07-22 22:20:55,502 INFO&nbsp; crawl.Crawl - depth = 3<br />
2009-07-22 22:20:55,502 INFO&nbsp; crawl.Crawl - topN = 100<br />
2009-07-22 22:20:55,603 INFO&nbsp; crawl.Injector - Injector: starting<br />
2009-07-22 22:20:55,604 INFO&nbsp; crawl.Injector - Injector: crawlDb: crawl/crawldb<br />
2009-07-22 22:20:55,604 INFO&nbsp; crawl.Injector - Injector: urlDir: urls<br />
2009-07-22 22:20:55,605 INFO&nbsp; crawl.Injector - Injector: Converting injected urls to crawl db entries.<br />
2009-07-22 22:20:56,574 INFO&nbsp; plugin.PluginRepository - Plugins: looking in: /opt/nutch/plugins<br />
2009-07-22 22:20:56,720 INFO&nbsp; plugin.PluginRepository - Plugin Auto-activation mode: [true]<br />
2009-07-22 22:20:56,720 INFO&nbsp; plugin.PluginRepository - Registered Plugins:<br />
2009-07-22 22:20:56,720 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; the nutch core extension points (nutch-extensionpoints)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Basic Query Filter (query-basic)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Basic URL Normalizer (urlnormalizer-basic)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Basic Indexing Filter (index-basic)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Html Parse Plug-in (parse-html)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Basic Summarizer Plug-in (summary-basic)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Site Query Filter (query-site)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; HTTP Framework (lib-http)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Text Parse Plug-in (parse-text)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Regex URL Filter (urlfilter-regex)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Pass-through URL Normalizer (urlnormalizer-pass)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Http Protocol Plug-in (protocol-http)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Regex URL Normalizer (urlnormalizer-regex)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; OPIC Scoring Plug-in (scoring-opic)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; CyberNeko HTML Parser (lib-nekohtml)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; JavaScript Parser (parse-js)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; URL Query Filter (query-url)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Regex URL Filter Framework (lib-regex-filter)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository - Registered Extension-Points:<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Nutch Summarizer (org.apache.nutch.searcher.Summarizer)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Nutch Protocol (org.apache.nutch.protocol.Protocol)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)<br />
2009-07-22 22:20:56,721 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Nutch URL Filter (org.apache.nutch.net.URLFilter)<br />
2009-07-22 22:20:56,722 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)<br />
2009-07-22 22:20:56,722 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)<br />
2009-07-22 22:20:56,722 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)<br />
2009-07-22 22:20:56,722 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Nutch Content Parser (org.apache.nutch.parse.Parser)<br />
2009-07-22 22:20:56,722 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)<br />
2009-07-22 22:20:56,722 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)<br />
2009-07-22 22:20:56,722 INFO&nbsp; plugin.PluginRepository -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Ontology Model Loader (org.apache.nutch.ontology.Ontology)<br />
2009-07-22 22:20:56,786 WARN&nbsp; regex.RegexURLNormalizer - can't find rules for scope 'inject', using default<br />
2009-07-22 22:20:56,829 WARN&nbsp; mapred.LocalJobRunner - job_2319eh<br />
java.lang.RuntimeException: java.net.UnknownHostException: jackliu: jackliu<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.io.SequenceFile$Writer.&lt;init&gt;(SequenceFile.java:617)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.io.SequenceFile$Writer.&lt;init&gt;(SequenceFile.java:591)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:364)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:390)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.startPartition(MapTask.java:294)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:355)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$100(MapTask.java:231)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.mapred.MapTask.run(MapTask.java:180)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)<br />
Caused by: java.net.UnknownHostException: jackliu: jackliu<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at java.net.InetAddress.getLocalHost(InetAddress.java:1353)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at org.apache.hadoop.io.SequenceFile$Writer.&lt;init&gt;(SequenceFile.java:614)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ... 8 more</p>
<p>也就是Host配置错误，于是：<br />
Add the following to your /etc/hosts file<br />
127.0.0.1&nbsp;&nbsp;&nbsp; jackliu</p>
<p>这次再次运行，结果成功！</p>
<p>&nbsp;</p>
<p>2:http://127.0.0.1:8080/nutch-0.9<br />
&nbsp;输入nutch进行查询，结果报错：<br />
&nbsp;HTTP Status 500 -</p>
<p>type Exception report</p>
<p>message</p>
<p>description The server encountered an internal error () that prevented it from fulfilling this request.</p>
<p>exception</p>
<p>org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value&nbsp; language + "/include/header.html" is quoted with " which must be escaped when used within the value<br />
&nbsp;org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)<br />
&nbsp;org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:407)<br />
&nbsp;org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:198)<br />
&nbsp;org.apache.jasper.compiler.Parser.parseQuoted(Parser.java:299)<br />
&nbsp;org.apache.jasper.compiler.Parser.parseAttributeValue(Parser.java:249)<br />
&nbsp;org.apache.jasper.compiler.Parser.parseAttribute(Parser.java:211)<br />
&nbsp;org.apache.jasper.compiler.Parser.parseAttributes(Parser.java:154)<br />
&nbsp;org.apache.jasper.compiler.Parser.parseInclude(Parser.java:867)<br />
&nbsp;org.apache.jasper.compiler.Parser.parseStandardAction(Parser.java:1134)<br />
&nbsp;org.apache.jasper.compiler.Parser.parseElements(Parser.java:1461)<br />
&nbsp;org.apache.jasper.compiler.Parser.parse(Parser.java:137)<br />
&nbsp;org.apache.jasper.compiler.ParserController.doParse(ParserController.java:255)<br />
&nbsp;org.apache.jasper.compiler.ParserController.parse(ParserController.java:103)<br />
&nbsp;org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:170)<br />
&nbsp;org.apache.jasper.compiler.Compiler.compile(Compiler.java:332)<br />
&nbsp;org.apache.jasper.compiler.Compiler.compile(Compiler.java:312)<br />
&nbsp;org.apache.jasper.compiler.Compiler.compile(Compiler.java:299)<br />
&nbsp;org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:586)<br />
&nbsp;org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:317)<br />
&nbsp;org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)<br />
&nbsp;org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)<br />
&nbsp;javax.servlet.http.HttpServlet.service(HttpServlet.java:717)</p>
<p>note The full stack trace of the root cause is available in the Apache Tomcat/6.0.20 logs.</p>
<p>分析：查看nutch Web应用根目录下的search.jsp可知，是引号匹配的问题。</p>
<p>&lt;jsp:include page="&lt;%= language + "/include/header.html"%&gt;"/&gt;&nbsp; //line 152 search.jsp</p>
<p>第一个引号和后面第一个出现的引号进行匹配，而不是和这一行最后一个引号进行匹配，所以问题就出现了。</p>
<p>解决方法：</p>
<p>将该行代码修改为：&lt;jsp:include page="&lt;%= language+urlsuffix %&gt;"/&gt;</p>
<p>这里我们定一个字符串urlsuffix，我们把它定义在language字符串定义之后，</p>
<p>&nbsp; String language =&nbsp;&nbsp; // line 116 search.jsp<br />
&nbsp;&nbsp;&nbsp; ResourceBundle.getBundle("org.nutch.jsp.search", request.getLocale())<br />
&nbsp;&nbsp;&nbsp; .getLocale().getLanguage();<br />
&nbsp;String urlsuffix="/include/header.html";</p>
<p>修改完成后，为确保修改成功，重启一下Tomcat服务器，进行搜索，不再报错。</p>
<p><br />
3.无法查询结果？<br />
&nbsp; 对比nutch_log.log的结果发现和网上描述的不同，而且crawl里面只有两个文件夹segments和crawldb，后来重新爬了一次<br />
&nbsp; 发现这次是好的，奇怪不知道为什么上次爬的失败了。<br />
&nbsp; <br />
4.cached.jsp explain.jsp等都有上面3的错误，更正过去就OK了。</p>
<p>5.今天花了一上午和半个下午的时间终于搞定了nutch的安装和配置了。明天继续学习。</p>
<img src ="http://www.blogjava.net/persister/aggbug/288039.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2009-07-23 15:43 <a href="http://www.blogjava.net/persister/archive/2009/07/23/288039.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PhraseQuery、SpanQuery和PhrasePrefixQuery</title><link>http://www.blogjava.net/persister/archive/2009/07/14/286634.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Tue, 14 Jul 2009 01:49:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2009/07/14/286634.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/286634.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2009/07/14/286634.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/286634.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/286634.html</trackback:ping><description><![CDATA[<p>PhraseQuery使用位置信息来进行相关查询，比如TermQuery使用&#8220;我们&#8221;和&#8220;祖国&#8221;进行查询，那么文档中含有这两个词的所有记录都会被查询出来。但是有一种情况，我们可能需要查询&#8220;我们&#8221;和&#8220;中国&#8221;之间只隔一个字和两个字或者两个字等，而不是它们之间字距相差十万八千里，就可以使用PhraseQuery。比如下面的情况：<br />
&nbsp;&nbsp;&nbsp;&nbsp;doc.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));<br />
那么：<br />
&nbsp;&nbsp;&nbsp;&nbsp;String[] phrase = new String[] {"quick", "fox"};<br />
&nbsp;&nbsp;&nbsp;&nbsp;assertFalse("exact phrase not found", matched(phrase, 0));<br />
&nbsp;&nbsp;&nbsp;&nbsp;assertTrue("close enough", matched(phrase, 1));<br />
multi-terms:<br />
&nbsp;&nbsp;&nbsp;&nbsp;assertFalse("not close enough", matched(new String[] {"quick", "jumped", "lazy"}, 3));<br />
&nbsp;&nbsp;&nbsp;&nbsp;assertTrue("just enough", matched(new String[] {"quick", "jumped", "lazy"}, 4));<br />
&nbsp;&nbsp;&nbsp;&nbsp;assertFalse("almost but not quite", matched(new String[] {"lazy", "jumped", "quick"}, 7));<br />
&nbsp;&nbsp;&nbsp;&nbsp;assertTrue("bingo", matched(new String[] {"lazy", "jumped", "quick"}, 8));<br />
<br />
数字表示slop，通过如下方式设置，表示按照顺序从第一个字段到第二个字段之间间隔的term个数。<br />
&nbsp;&nbsp;&nbsp;&nbsp;query.setSlop(slop);</p>
<p>顺序很重要：<br />
&nbsp;&nbsp;&nbsp;&nbsp;String[] phrase = new String[] {"fox", "quick"};<br />
assertFalse("hop flop", matched(phrase, 2));<br />
assertTrue("hop hop slop", matched(phrase, 3));<br />
<br />
原理如下图所示：<br />
<br />
<img alt="" src="http://www.blogjava.net/images/blogjava_net/persister/untitled.JPG" width="468" border="0" height="240" /><br />
对于查询关键字quick和fox，只需要fox移动一个位置即可匹配quick brown fox。而对于fox和quick这两个关键字<br />
需要将fox移动三个位置。移动的距离越大，那么这项记录的score就越小，被查询出来的可能行就越小了。<br />
<br />
SpanQuery利用位置信息查询更有意思的查询：<br />
<br />
SpanQuery type&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Description<br />
SpanTermQuery&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Used in conjunction with the other span query types. On its own, it&#8217;s<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;functionally equivalent to TermQuery.<br />
SpanFirstQuery&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Matches spans that occur within the first part of a field.<br />
SpanNearQuery&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Matches spans that occur near one another.<br />
SpanNotQuery&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Matches spans that don&#8217;t overlap one another.<br />
SpanOrQuery&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Aggregates matches of span queries.<br />
<br />
SpanFirstQuery：To query for spans that occur within the first n positions of a field, use Span-FirstQuery.<br />
<br />
<img alt="" src="http://www.blogjava.net/images/blogjava_net/persister/span.jpg" border="0" /><br />
<br />
quick = new SpanTermQuery(new Term("f", "quick"));<br />
brown = new SpanTermQuery(new Term("f", "brown"));<br />
red = new SpanTermQuery(new Term("f", "red"));<br />
fox = new SpanTermQuery(new Term("f", "fox"));<br />
lazy = new SpanTermQuery(new Term("f", "lazy"));<br />
sleepy = new SpanTermQuery(new Term("f", "sleepy"));<br />
dog = new SpanTermQuery(new Term("f", "dog"));<br />
cat = new SpanTermQuery(new Term("f", "cat"));<br />
<br />
SpanFirstQuery sfq = new SpanFirstQuery(brown, 2);<br />
assertNoMatches(sfq);<br />
sfq = new SpanFirstQuery(brown, 3);<br />
assertOnlyBrownFox(sfq);<br />
<br />
SpanNearQuery：<br />
<br />
彼此相邻的跨度 </p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> 首先，强调一下PhraseQuery对象，这个对象不属于跨度查询类，但能完成跨度查询功能。</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> 匹配到的文档所包含的项通常是彼此相邻的，考虑到原文档中在查询项之间可能有一些中间项，或为了能查询倒排的项，PhraseQuery设置了slop因子，<font color="#ff0000">但是这个slop因子指2个项允许最大间隔距离，不是传统意义上的距离，是按顺序组成给定的短语，所需要移动位置的次数</font>，<font color="#0000ff">这表示PhraseQuery是必须按照项在文档中出现的顺序计算跨度的</font>，如quick brown fox为文档，则quick fox2个项的slop为1，quick向后移动一次.而fox quick需要quick向后移动3次，所以slop为3</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> 其次，来看一下SpanQuery的子类SpanTermQuery。</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> 它能跨度查询，<font color="#0000ff">并且不一定非要按项在文档中出现的顺序</font>，可以用一个独立的标记表示查询对象必须按顺序，或允许按倒过来的顺序完成匹配。<font color="#ff0000">匹配的跨度也不是指移动位置的次数，是指从第一个跨度的起始位置到最后一个跨度的结束位置。</font></p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> 在SpanNearQuery中将SpanTermQuery对象作为SpanQuery对象使用的效果，与使用PharseQuery的效果非常相似。在SpanNearQuery的构造函数中的第三个参数为inOrder标志，设置这个标志，表示按项在文档中出现的顺序倒过来的顺序。</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>如:the quick brown fox jumps over the lazy dog这个文档</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> public void testSpanNearQuery() throws Exception{</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> SpanQuery[] quick_brown_dog=new SpanQuery[]{quick,brown,dog};</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,0,true);//按正常顺序,跨度为0,对三个项进行查询</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> assertNoMatches(snq);//无法匹配</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,4,true);//按正常顺序,跨度为4,对三个项进行查询</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> assertNoMatches(snq);//无法匹配</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,4,true);//按正常顺序,跨度为5,对三个项进行查询</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> assertOnlyBrownFox(snq);//匹配成功&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr></p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr><font color="#cc0099">&nbsp;<wbr>&nbsp;<wbr> SpanNearQuery snq=new SpanNearQuery(new SpanQuery[]{lazy,fox},3,false);//</font>按相反顺序,跨度为3,对三个项进行查询</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> assertOnlyBrownFox(snq);//匹配成功&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr></p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>//下面使用PhraseQuery进行查询，因为是按顺序，所以lazy和fox必须要跨度为5</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> PhraseQuery pq=new PhraseQuery();</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> pq.add(new Term("f","lazy"));</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> pq.add(new Term("f","lazy"));</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> pq.setslop(4);</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>assertNoMatches(pq);//跨度4无法匹配</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> //PharseQuery,slop因子为5</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> pq.setSlop(5);</p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> assertOnlyBrownFox(pq);&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr></p>
<p>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr>&nbsp;<wbr> }<br />
<br />
<br />
3.PhrasePrefixQuery 主要用来进行同义词查询的：<br />
&nbsp;&nbsp;&nbsp;&nbsp;IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true);<br />
&nbsp;&nbsp;&nbsp;&nbsp;Document doc1 = new Document();<br />
&nbsp;&nbsp;&nbsp;&nbsp;doc1.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));<br />
&nbsp;&nbsp;&nbsp;&nbsp;writer.addDocument(doc1);<br />
&nbsp;&nbsp;&nbsp;&nbsp;Document doc2 = new Document();<br />
&nbsp;&nbsp;&nbsp;&nbsp;doc2.add(Field.Text("field","the fast fox hopped over the hound"));<br />
&nbsp;&nbsp;&nbsp;&nbsp;writer.addDocument(doc2);<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;PhrasePrefixQuery query = new PhrasePrefixQuery();<br />
&nbsp;&nbsp;&nbsp;&nbsp;query.add(new Term[] {new Term("field", "quick"), new Term("field", "fast")});<br />
&nbsp;&nbsp;&nbsp;&nbsp;query.add(new Term("field", "fox"));<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;Hits hits = searcher.search(query);<br />
&nbsp;&nbsp;&nbsp;&nbsp;assertEquals("fast fox match", 1, hits.length());<br />
&nbsp;&nbsp;&nbsp;&nbsp;query.setSlop(1);<br />
&nbsp;&nbsp;&nbsp;&nbsp;hits = searcher.search(query);<br />
&nbsp;&nbsp;&nbsp;&nbsp;assertEquals("both match", 2, hits.length());</p>
<img src ="http://www.blogjava.net/persister/aggbug/286634.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2009-07-14 09:49 <a href="http://www.blogjava.net/persister/archive/2009/07/14/286634.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>搜索引擎中对于输入查询关键词的一些考虑</title><link>http://www.blogjava.net/persister/archive/2009/07/11/286377.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Sat, 11 Jul 2009 09:33:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2009/07/11/286377.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/286377.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2009/07/11/286377.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/286377.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/286377.html</trackback:ping><description><![CDATA[1、首先就是错别字。怎么判断输入的次为错别字呢？或者就算是有错别字也查询去正确的结果。Luncene使用Metaphone algorithm<br />
<br />
2、近义词查询。 SynonymAnalyzer和PhrasePrefixQuery都能解决这个问题。
<img src ="http://www.blogjava.net/persister/aggbug/286377.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2009-07-11 17:33 <a href="http://www.blogjava.net/persister/archive/2009/07/11/286377.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Analyzer</title><link>http://www.blogjava.net/persister/archive/2009/07/07/285833.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Tue, 07 Jul 2009 07:59:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2009/07/07/285833.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/285833.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2009/07/07/285833.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/285833.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/285833.html</trackback:ping><description><![CDATA[Primary analyzers available in Lucene &nbsp; <br />
&nbsp;Analyzer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Steps taken &nbsp; <br />
WhitespaceAnalyzer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Splits tokens at whitespace &nbsp; <br />
SimpleAnalyzer &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Divides text at nonletter characters and lowercases &nbsp; <br />
StopAnalyzer &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Divides text at nonletter characters, lowercases, and removes stop words &nbsp; <br />
StandardAnalyzer &nbsp;&nbsp;&nbsp;&nbsp; Tokenizes based on a sophisticated grammar that recognizes <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; e-mail addresses, acronyms, Chinese- Japanese-Korean characters, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; alphanumerics， and more; lowercases;and removes stop words&nbsp;&nbsp; <br />
<img src ="http://www.blogjava.net/persister/aggbug/285833.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2009-07-07 15:59 <a href="http://www.blogjava.net/persister/archive/2009/07/07/285833.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Porter stemming algorithm</title><link>http://www.blogjava.net/persister/archive/2009/07/06/285728.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Mon, 06 Jul 2009 14:47:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2009/07/06/285728.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/285728.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2009/07/06/285728.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/285728.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/285728.html</trackback:ping><description><![CDATA[PorterStemFilter<br />
所谓<a href="http://en.wikipedia.org/wiki/Stemming" target="_blank"><strong>Stemming</strong></a>，可以称为<strong>词根化</strong>，这里有个<strong><a href="http://www.comp.lancs.ac.uk/computing/research/stemming/general/index.htm" target="_blank">overview</a></strong>。在英语这样的拉丁语系里面，单词有多种变形。比如加上-ed、-ing、-ly等等。在分词的时候，如果能够把这些变形单词的词根找出了，对搜索结果是很有帮助的。Stemming算法有很多了，三大主流算法是<a href="http://www.tartarus.org/~martin/PorterStemmer/index.html" target="_blank"><strong>Porter stemming algorithm</strong></a>、<a href="http://www.cs.waikato.ac.nz/~eibe/stemmers/index.html" target="_blank"><strong>Lovins stemming algorithm</strong></a>、<a href="http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm" target="_blank"><strong>Lancaster (Paice/Husk) stemming algorithm</strong></a>，还有一些改进的或其它的算法。这个PorterStemFilter里面调用的一个PorterStemmer就是<a href="http://www.tartarus.org/~martin/PorterStemmer/index.html" target="_blank"><strong>Porter Stemming algorithm</strong></a>的一个实现。
<img src ="http://www.blogjava.net/persister/aggbug/285728.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2009-07-06 22:47 <a href="http://www.blogjava.net/persister/archive/2009/07/06/285728.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Lucene倒排索引原理</title><link>http://www.blogjava.net/persister/archive/2009/06/10/281201.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Wed, 10 Jun 2009 10:08:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2009/06/10/281201.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/281201.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2009/06/10/281201.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/281201.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/281201.html</trackback:ping><description><![CDATA[zz:http://blog.donews.com/windshow/archive/2005/11/24/638234.aspx<br />
<br />
倒排索引：Inverted index<br />
<br />
Lucene是一个高性能的java全文检索工具包，它使用的是倒排文件索引结构。该结构及相应的生成算法如下：<br />
<br />
0）设有两篇文章1和2<br />
文章1的内容为：Tom&nbsp;lives&nbsp;in&nbsp;Guangzhou,I&nbsp;live&nbsp;in&nbsp;Guangzhou&nbsp;too.<br />
文章2的内容为：He&nbsp;once&nbsp;lived&nbsp;in&nbsp;Shanghai.<br />
<br />
1)由于lucene是基于关键词索引和查询的，首先我们要取得这两篇文章的关键词，通常我们需要如下处理措施<br />
a.我们现在有的是文章内容，即一个字符串，我们先要找出字符串中的所有单词，即分词。英文单词由于用空格分隔，比较好处理。中文单词间是连在一起的需要特殊的分词处理。<br />
b.文章中的&#8221;in&#8221;,&nbsp;&#8220;once&#8221;&nbsp;&#8220;too&#8221;等词没有什么实际意义，中文中的&#8220;的&#8221;&#8220;是&#8221;等字通常也无具体含义，这些不代表概念的词可以过滤掉<br />
c.用户通常希望查&#8220;He&#8221;时能把含&#8220;he&#8221;，&#8220;HE&#8221;的文章也找出来，所以所有单词需要统一大小写。<br />
d.用户通常希望查&#8220;live&#8221;时能把含&#8220;lives&#8221;，&#8220;lived&#8221;的文章也找出来，所以需要把&#8220;lives&#8221;，&#8220;lived&#8221;还原成&#8220;live&#8221;<br />
e.文章中的标点符号通常不表示某种概念，也可以过滤掉<br />
在lucene中以上措施由Analyzer类完成<br />
<br />
经过上面处理后<br />
&nbsp;&nbsp;&nbsp;&nbsp;文章1的所有关键词为：[tom]&nbsp;[live]&nbsp;[guangzhou]&nbsp;[i]&nbsp;[live]&nbsp;[guangzhou]<br />
&nbsp;&nbsp;&nbsp;&nbsp;文章2的所有关键词为：[he]&nbsp;[live]&nbsp;[shanghai]<br />
<br />
2)&nbsp;有了关键词后，我们就可以建立倒排索引了。上面的对应关系是：&#8220;文章号&#8221;对&#8220;文章中所有关键词&#8221;。倒排索引把这个关系倒过来，变成：&#8220;关键词&#8221;对&#8220;拥有该关键词的所有文章号&#8221;。文章1，2经过倒排后变成<br />
关键词&nbsp;&nbsp;&nbsp;文章号<br />
guangzhou&nbsp;&nbsp;1<br />
he&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2<br />
i&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1<br />
live&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1,2<br />
shanghai&nbsp;&nbsp;&nbsp;2<br />
tom&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1<br />
<br />
通常仅知道关键词在哪些文章中出现还不够，我们还需要知道关键词在文章中出现次数和出现的位置，通常有两种位置：a)字符位置，即记录该词是文章中第几个字符（优点是关键词亮显时定位快）；b)关键词位置，即记录该词是文章中第几个关键词（优点是节约索引空间、词组（phase）查询快），lucene中记录的就是这种位置。<br />
<br />
加上&#8220;出现频率&#8221;和&#8220;出现位置&#8221;信息后，我们的索引结构变为：<br />
关键词&nbsp;&nbsp;&nbsp;文章号[出现频率]&nbsp;&nbsp;&nbsp;出现位置<br />
guangzhou&nbsp;1[2]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3，6<br />
he&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2[1]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1<br />
i&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1[1]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;4<br />
live&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1[2],2[1]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2，5，2<br />
shanghai&nbsp;&nbsp;2[1]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3<br />
tom&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1[1]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1<br />
<br />
以live这行为例我们说明一下该结构：live在文章1中出现了2次，文章2中出现了一次，它的出现位置为&#8220;2,5,2&#8221;这表示什么呢？我们需要结合文章号和出现频率来分析，文章1中出现了2次，那么&#8220;2,5&#8221;就表示live在文章1中出现的两个位置，文章2中出现了一次，剩下的&#8220;2&#8221;就表示live是文章2中第2个关键字。<br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
以上就是lucene索引结构中最核心的部分。我们注意到关键字是按字符顺序排列的（lucene没有使用B树结构），因此lucene可以用二元搜索算法快速定位关键词。<br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
实现时&nbsp;lucene将上面三列分别作为词典文件（Term&nbsp;Dictionary）、频率文件(frequencies)、位置文件(positions)保存。其中词典文件不仅保存有每个关键词，还保留了指向频率文件和位置文件的指针，通过指针可以找到该关键字的频率信息和位置信息。<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;Lucene中使用了field的概念，用于表达信息所在位置（如标题中，文章中，url中），在建索引中，该field信息也记录在词典文件中，每个关键词都有一个field信息(因为每个关键字一定属于一个或多个field)。<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;为了减小索引文件的大小，Lucene对索引还使用了压缩技术。首先，对词典文件中的关键词进行了压缩，关键词压缩为&lt;前缀长度，后缀&gt;，例如：当前词为&#8220;阿拉伯语&#8221;，上一个词为&#8220;阿拉伯&#8221;，那么&#8220;阿拉伯语&#8221;压缩为&lt;3，语&gt;。其次大量用到的是对数字的压缩，数字只保存与上一个值的差值（这样可以减小数字的长度，进而减少保存该数字需要的字节数）。例如当前文章号是16389（不压缩要用3个字节保存），上一文章号是16382，压缩后保存7（只用一个字节）。<br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;下面我们可以通过对该索引的查询来解释一下为什么要建立索引。<br />
假设要查询单词&nbsp;&#8220;live&#8221;，lucene先对词典二元查找、找到该词，通过指向频率文件的指针读出所有文章号，然后返回结果。词典通常非常小，因而，整个过程的时间是毫秒级的。<br />
而用普通的顺序匹配算法，不建索引，而是对所有文章的内容进行字符串匹配，这个过程将会相当缓慢，当文章数目很大时，时间往往是无法忍受的。<br />
<br />
自我评论：<br />
还可以参考http://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95<br />
<br />
<br />
二元搜索算法<br />
在排好序的数组中找到特定的元素。<br />
首先, 比较数组中间的元素，如果相同，则返回此元素的指针，表示找到了。 如果不相同， 此函数就会继续搜索其中大小相符的一半，然后继续下去。如果剩下的数组长度为0，则表示找不到，那么函数就会结束。<br />
此算法函数如下：<br />
<pre>int *binarySearch(int val, int array[], int n)<br />
{<br />
int m = n/2;<br />
if(n &lt;= 0) return NULL;<br />
if(val == array[m]) return array + m;<br />
if(val &lt; array[m]) return binarySearch(val, array, m);<br />
else return binarySearch(val, array+m+1, n-m-1);<br />
}</pre>
<p><br />
对于有n个元素的数组来说，二元搜索算法进行最多1+log2(n)次比较。 如果有一百万元素，大概比较20次，也就是最多20次递归执行binarySearch()函数。</p>
<br />
<img src ="http://www.blogjava.net/persister/aggbug/281201.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2009-06-10 18:08 <a href="http://www.blogjava.net/persister/archive/2009/06/10/281201.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Lucene学习index</title><link>http://www.blogjava.net/persister/archive/2009/06/09/281032.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Tue, 09 Jun 2009 15:33:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2009/06/09/281032.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/281032.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2009/06/09/281032.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/281032.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/281032.html</trackback:ping><description><![CDATA[<p>1.Adding documents to an index：<br />
&nbsp;protected String[] keywords = {"1", "2"};<br />
&nbsp;protected String[] unindexed = {"Netherlands", "Italy"};<br />
&nbsp;protected String[] unstored = {"Amsterdam has lots of bridges", "Venice has lots of canals"};<br />
&nbsp;protected String[] text = {"Amsterdam", "Venice"};<br />
&nbsp;Directory dir = FSDirectory.getDirectory(indexDir, true);<br />
&nbsp;IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true);<br />
&nbsp;writer.setUseCompoundFile(true);<br />
&nbsp;for (int i = 0; i &lt; keywords.length; i++) {<br />
&nbsp;&nbsp;Document doc = new Document();<br />
&nbsp;&nbsp;doc.add(Field.Keyword("id", keywords[i]));<br />
&nbsp;&nbsp;doc.add(Field.UnIndexed("country", unindexed[i]));<br />
&nbsp;&nbsp;doc.add(Field.UnStored("contents", unstored[i]));<br />
&nbsp;&nbsp;doc.add(Field.Text("city", text[i]));<br />
&nbsp;&nbsp;writer.addDocument(doc);<br />
&nbsp;}<br />
&nbsp;writer.optimize();<br />
&nbsp;writer.close();<br />
2.Removing Documents from an index：<br />
&nbsp;IndexReader reader = IndexReader.open(dir);<br />
&nbsp;reader.delete(1);<br />
上面的方式一次只能删除一个document，下面的方法可以删除多个满足条件的document<br />
&nbsp;IndexReader reader = IndexReader.open(dir);<br />
&nbsp;reader.delete(new Term("city", "Amsterdam"));<br />
&nbsp;reader.close();</p>
<p>3.Index dates<br />
&nbsp;Document doc = new Document();<br />
&nbsp;doc.add(Field.Keyword("indexDate", new Date()));</p>
<p>4.Tuning indexing performance<br />
&nbsp;IndexWriter&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; System property&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Default value&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Description<br />
&nbsp;--------------------------------------------------------------------------------------------------<br />
&nbsp;mergeFactor&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;org.apache.lucene.mergeFactor&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 10&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Controls segment merge&nbsp; frequency and size<br />
&nbsp;maxMergeDocs&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;org.apache.lucene.maxMergeDocs&nbsp;&nbsp;&nbsp;Integar.MAX_VALUE&nbsp;&nbsp;&nbsp; Limits the number of&nbsp; documents per segement<br />
&nbsp;minMergeDocs &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;org.apache.lucene.minMergeDocs&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;10&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Controls the amount of&nbsp;&nbsp; RAM used when indexing</p>
<p>mergeFactor控制写入硬盘前内存中缓存的document数量，同时控制merge index segments的频率。其默认值是10，即存满10个<br />
documents后就必须写入硬盘，而且如果segment的数量达到10的级数的时候会merge成一个segment，当然maxMergeDocs限制了每个<br />
segment最大能够保存的document数量。mergeFactor越大的话就越能利用RAM，提高index的效率，但是mergeFactor越高也就意味着<br />
merge的频率就越低，会可能导致segments的数量很大（因为没有merge），这样search的时候就需要打开更多的segment文件，也就<br />
降低了search的效率。minMergeDocs is another IndexWriter instance variable that affects indexing performance. Its <br />
value controls how many Documents have to be buffered before they&#8217;re merged to a segment.也即是说minMergeDocs也具有<br />
mergeFactor控制缓存document数量的功能。</p>
<p>5.RAMDirectory帮助利用RAM，也可以采用集群或者多线程的方式充分利用硬件和软件资源，提高index的效率。</p>
<p>6.有时候对于每个field可能希望控制其大小，比如只对前1000个term做index，这个时候就需要使用maxFieldLength来控制。</p>
<p>7.IndexWriter&#8217;s optimize()方法就是将segments进行merge，降低segments的数量从而减少search的时候读取index的时间。</p>
<p>8.注意多线程环境下的工作：an index-modifying IndexReader operation can&#8217;t be executed <br />
while an index-modifying IndexWriter operation is in progress.为了防止误用，Lucene在使用某些API时会给<br />
index上锁。</p>
<img src ="http://www.blogjava.net/persister/aggbug/281032.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2009-06-09 23:33 <a href="http://www.blogjava.net/persister/archive/2009/06/09/281032.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Lucene的Query</title><link>http://www.blogjava.net/persister/archive/2009/06/08/280567.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Mon, 08 Jun 2009 02:05:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2009/06/08/280567.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/280567.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2009/06/08/280567.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/280567.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/280567.html</trackback:ping><description><![CDATA[<p>Lucene基本的查询语句：<br />
&nbsp;Searcher searcher = new IndexSearcher(dbpath);<br />
&nbsp;Query query = QueryParser.parse(searchkey, searchfield,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;new StandardAnalyzer());<br />
&nbsp;Hits hits = searcher.search(query);<br />
下面是Query的各种子查询，他们斗鱼QueryParser都有对应关系。</p>
<p>1.TermQuery常用，对一个Term（最小的索引块，包含一个field名字和值）进行索引查询。<br />
Term直接与QueryParser.parse里面的key和field直接对应。</p>
<p>&nbsp;IndexSearcher searcher = new IndexSearcher(directory);<br />
&nbsp;Term t = new Term("isbn", "1930110995");<br />
&nbsp;Query query = new TermQuery(t);<br />
&nbsp;Hits hits = searcher.search(query);</p>
<p>2.RangeQuery用于区间查询,RangeQuery的第三个参数表示是开区间还是闭区间。<br />
QueryParser会构建从begin到end之间的N个查询进行查询。</p>
<p>&nbsp;Term begin, end;<br />
&nbsp;Searcher searcher = new IndexSearcher(dbpath);<br />
&nbsp;begin = new Term("pubmonth","199801");<br />
&nbsp;end = new Term("pubmonth","199810");<br />
&nbsp;RangeQuery query = new RangeQuery(begin, end, true);<br />
</p>
<p>RangeQuery本质是比较大小。所以如下查询也是可以的，但是意义就于上面不大一样了，总之是大小的比较<br />
设定了一个区间，在区间内的都能够搜索出来，这里就存在一个比较大小的原则，比如字符串会首先比较第一个字符，这样与字符长度没有关系。<br />
begin = new Term("pubmonth","19");<br />
&nbsp;end = new Term("pubmonth","20");<br />
&nbsp;RangeQuery query = new RangeQuery(begin, end, true);<br />
<br />
<br />
3.PrefixQuery.对于TermQuery，必须完全匹配（用Field.Keyword生成的字段）才能够查询出来。<br />
这就制约了查询的灵活性，PrefixQuery只需要匹配value的前面任何字段即可。如Field为name，记录<br />
中那么有jackliu,jackwu,jackli,那么使用jack就可以查询出所有的记录。QueryParser creates a PrefixQuery<br />
for a term when it ends with an asterisk (*) in query expressions.</p>
<p>&nbsp;IndexSearcher searcher = new IndexSearcher(directory);<br />
&nbsp;Term term = new Term("category", "/technology/computers/programming");<br />
&nbsp;PrefixQuery query = new PrefixQuery(term);<br />
&nbsp;Hits hits = searcher.search(query);</p>
<p>4.BooleanQuery.上面所有的查询都是基于单个field的查询，多个field怎么查询呢，BooleanQuery<br />
就是解决多个查询的问题。通过add(Query query, boolean required, boolean prohibited)加入<br />
多个查询.通过BooleanQuery的嵌套可以组合非常复杂的查询。<br />
&nbsp;<br />
&nbsp;IndexSearcher searcher = new IndexSearcher(directory);<br />
&nbsp;TermQuery searchingBooks =<br />
&nbsp;new TermQuery(new Term("subject","search"));</p>
<p>&nbsp;RangeQuery currentBooks =<br />
&nbsp;new RangeQuery(new Term("pubmonth","200401"),<br />
&nbsp;&nbsp;new Term("pubmonth","200412"),true);<br />
&nbsp;&nbsp;<br />
&nbsp;BooleanQuery currentSearchingBooks = new BooleanQuery();<br />
&nbsp;currentSearchingBooks.add(searchingBook s, true, false);<br />
&nbsp;currentSearchingBooks.add(currentBooks, true, false);<br />
&nbsp;Hits hits = searcher.search(currentSearchingBooks);</p>
<p>BooleanQuery的add方法有两个boolean参数：<br />
true＆false：表明当前加入的子句是必须要满足的；<br />
false＆true：表明当前加入的子句是不可以被满足的；<br />
false＆false：表明当前加入的子句是可选的；<br />
true＆true：错误的情况。</p>
<p>QueryParser handily constructs BooleanQuerys when multiple terms are specified.<br />
Grouping is done with parentheses, and the prohibited and required flags are<br />
set when the &#8211;, +, AND, OR, and NOT operators are specified.</p>
<p>5.PhraseQuery进行更为精确的查找。它能够对索引文本中的两个或更多的关键词的位置进行<br />
限定。如搜查包含A和B并且A、B之间还有一个文字。Terms surrounded by double quotes in <br />
QueryParser parsed expressions are translated into a PhraseQuery.<br />
The slop factor defaults to zero, but you can adjust the slop factor <br />
by adding a tilde (~) followed by an integer. <br />
For example, the expression "quick fox"~3</p>
<p>6.WildcardQuery.WildcardQuery比PrefixQuery提供了更细的控制和更大的灵活性，这个最容易<br />
理解和使用。</p>
<p>7.FuzzyQuery.这个Query比较特别，它会查询与关键字长得很像的其他记录。QueryParser <br />
supports FuzzyQuery by suffixing a term with a tilde (~),for exmaple wuzza~.</p>
<p>&nbsp;public void testFuzzy() throws Exception {<br />
&nbsp;&nbsp;indexSingleFieldDocs(new Field[] {<br />
&nbsp;&nbsp;Field.Text("contents", "fuzzy"),<br />
&nbsp;&nbsp;Field.Text("contents", "wuzzy")<br />
&nbsp;&nbsp;});<br />
&nbsp;&nbsp;IndexSearcher searcher = new IndexSearcher(directory);<br />
&nbsp;&nbsp;Query query = new FuzzyQuery(new Term("contents", "wuzza"));<br />
&nbsp;&nbsp;Hits hits = searcher.search(query);<br />
&nbsp;&nbsp;assertEquals("both close enough", 2, hits.length());<br />
&nbsp;&nbsp;assertTrue("wuzzy closer than fuzzy",<br />
&nbsp;&nbsp;hits.score(0) != hits.score(1));<br />
&nbsp;&nbsp;assertEquals("wuzza bear","wuzzy", hits.doc(0).get("contents"));<br />
&nbsp;}<br />
</p>
<img src ="http://www.blogjava.net/persister/aggbug/280567.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2009-06-08 10:05 <a href="http://www.blogjava.net/persister/archive/2009/06/08/280567.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Lucene学习</title><link>http://www.blogjava.net/persister/archive/2009/03/06/258147.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Fri, 06 Mar 2009 03:03:00 GMT</pubDate><guid>http://www.blogjava.net/persister/archive/2009/03/06/258147.html</guid><wfw:comment>http://www.blogjava.net/persister/comments/258147.html</wfw:comment><comments>http://www.blogjava.net/persister/archive/2009/03/06/258147.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/persister/comments/commentRss/258147.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/persister/services/trackbacks/258147.html</trackback:ping><description><![CDATA[今天将&#8220;Lucene学习&#8221;里面的程序贴到eclipse工程里实现了一下<br />
加深了我对检索的理解<br />
在全文检索中，可以和数据库进行一个简单的对比<br />
全文检索没有表的概念，也就没有固定的fields，但是有记录，每一个记录就是一个Document对象<br />
每一个document都可以有自己不同的fields，如下：<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;Document doc = new Document();&nbsp;<br />
<br />
&nbsp;&nbsp;&nbsp;doc.add(Field.Keyword("filename",file.getAbsolutePath()));&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;//以下两句只能取一句,前者是索引不存储,后者是索引且存储 <br />
&nbsp;&nbsp;&nbsp;//doc.add(Field.Text("content",new FileReader(file)));&nbsp; <br />
&nbsp;&nbsp;&nbsp;doc.add(Field.Text("content",this.chgFileToString(file))); <br />
&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;indexWriter.addDocument(doc); <br />
<br />
在查询的时候，需要三个重要的参数<br />
首先是库路径，即在哪个库里面进行检索（相当于database的路径）：<br />
<br />
Searcher searcher = new IndexSearcher(dbpath); <br />
<br />
然后就是你以哪个字段，查询什么关键词，因为根据字段就可以得到字段对应的内容<br />
在得到的内容中检索你的关键词，这个累死sql语句，只不过没有表的概念<br />
Query query <br />
&nbsp;&nbsp;&nbsp; = QueryParser.parse(searchkey,searchfield,new StandardAnalyzer());&nbsp;<br />
<br />
然后开始查询，查询的结果就是document的集合：<br />
&nbsp;&nbsp;&nbsp;Hits hits = searcher.search(query);&nbsp;<br />
<br />
对得到的集合进行处理：<br />
<br />
&nbsp;&nbsp;&nbsp;if(hits != null)<br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; list = new ArrayList(); <br />
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;int temp_hitslength = hits.length(); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Document doc = null; <br />
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; for(int i = 0;i &lt; temp_hitslength; i++){ <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; doc = hits.doc(i); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //list.add(doc.get("filename")); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; list.add(doc.get("content"));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;}&nbsp;<br />
<br />
&nbsp; 附常用Field：<span style="font-size: 10pt; color: black; font-family: 宋体;"><br />
<br />
常用的</span><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Field</span><span style="font-size: 10pt; color: black; font-family: 宋体;">方法如下：</span><br />
<br />
<span style="font-size: 10pt; color: black; font-family: 'Courier New';"><br />
<table border="1" cellpadding="0">
    <tbody>
        <tr>
            <td style="padding: 0.75pt; width: 159.05pt;" width="212">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">方法</span></p>
            </td>
            <td style="padding: 0.75pt; width: 56.2pt;" width="75">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">切词</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">索引</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">存储</span></p>
            </td>
            <td style="padding: 0.75pt; width: 92.05pt;" width="123">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">用途</span></p>
            </td>
        </tr>
        <tr>
            <td style="padding: 0.75pt; width: 159.05pt;" width="212">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Field.Text(String name, String value)</span></p>
            </td>
            <td style="padding: 0.75pt; width: 56.2pt;" width="75">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Yes</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Yes</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Yes</span></p>
            </td>
            <td style="padding: 0.75pt; width: 92.05pt;" valign="top" width="123">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">切分词索引并存储，比如：标题，内容字段</span></p>
            </td>
        </tr>
        <tr>
            <td style="padding: 0.75pt; width: 159.05pt;" width="212">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Field.Text(String name, Reader value)</span></p>
            </td>
            <td style="padding: 0.75pt; width: 56.2pt;" width="75">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Yes</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Yes</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">No</span></p>
            </td>
            <td style="padding: 0.75pt; width: 92.05pt;" valign="top" width="123">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">切分词索引不存储，比如：</span><span style="font-size: 10pt; color: black; font-family: 'Courier New';">META</span><span style="font-size: 10pt; color: black; font-family: 宋体;">信息，</span></p>
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">不用于返回显示，但需要进行检索内容</span></p>
            </td>
        </tr>
        <tr>
            <td style="padding: 0.75pt; width: 159.05pt;" width="212">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Field.Keyword(String name, String value)</span></p>
            </td>
            <td style="padding: 0.75pt; width: 56.2pt;" width="75">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">No</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Yes</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Yes</span></p>
            </td>
            <td style="padding: 0.75pt; width: 92.05pt;" valign="top" width="123">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">不切分索引并存储，比如：日期字段</span></p>
            </td>
        </tr>
        <tr>
            <td style="padding: 0.75pt; width: 159.05pt;" width="212">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Field.UnIndexed(String name, String value)</span></p>
            </td>
            <td style="padding: 0.75pt; width: 56.2pt;" width="75">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">No</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">No</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Yes</span></p>
            </td>
            <td style="padding: 0.75pt; width: 92.05pt;" valign="top" width="123">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">不索引，只存储，比如：文件路径</span></p>
            </td>
        </tr>
        <tr>
            <td style="padding: 0.75pt; width: 159.05pt;" width="212">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Field.UnStored(String name, String value)</span></p>
            </td>
            <td style="padding: 0.75pt; width: 56.2pt;" width="75">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Yes</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">Yes</span></p>
            </td>
            <td style="padding: 0.75pt; width: 52.5pt;" width="70">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 'Courier New';">No</span></p>
            </td>
            <td style="padding: 0.75pt; width: 92.05pt;" valign="top" width="123">
            <p style="margin: 12pt 0cm 6pt; text-indent: 24pt; line-height: 16.5pt; text-align: center;" align="center"><span style="font-size: 10pt; color: black; font-family: 宋体;">只全文索引，不存储</span></p>
            </td>
        </tr>
    </tbody>
</table>
</span><span style="font-size: 10pt; color: black; font-family: 宋体;"><br />
切分词
就是指对文本进行切词，用于进行索引，上面可以看到切分的都会进行索引；索引即用于通过搜索词进行查询；存储表示是否存储内容本身。上面的
Field.Keyword方法就不切分但是可以索引，所以可以用这个字段进行查询，而Field.UnIndexed就不能进行查询了。但是由于
Field.Keyword不切分，所以当使用new
Term(searchkey,searchfield)进行查询时，给出的searchkey必须与vaue参数值完全一致才会查询出来，而
Field.Text和Field.UnStored则就不一样</span>。<br />
<br />
<a href="http://www.lucene.com.cn/about.htm">Lucene中国</a>是一个非常好的网站，对Lucene内部结构进行了详细的分析，可以参考。<br />
<br />
<br />
<img src ="http://www.blogjava.net/persister/aggbug/258147.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/persister/" target="_blank">persister</a> 2009-03-06 11:03 <a href="http://www.blogjava.net/persister/archive/2009/03/06/258147.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>