BlogJava-Java天空任我翱翔-随笔分类-Lucene,Nutch,Hadoop

Hadoop学习笔记（一）

persister — Fri, 12 Mar 2010 12:59:00 GMT

今天将Hadoop下载下来学习了一下文档中的tutorial，然后仿照如下链接实现了一个word count的例子：

用 Hadoop 进行分布式数据处理，第 1 部分: 入门

以下是一部分理论学习：
The storage is provided by HDFS, and analysis by MapReduce.

MapReduce is a good fit for problems
that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
An RDBMS is good for point queries or updates, where the dataset has been indexed
to deliver low-latency retrieval and update times of a relatively small amount of
data.
MapReduce suits applications where the data is written once, and read many
times, whereas a relational database is good for datasets that are continually updated.

MapReduce tries to colocate the data with the compute node, so data access is fast
since it is local.* This feature, known as data locality, is at the heart of MapReduce and
is the reason for its good performance.

Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined
map function for each record in the split.

On the other hand, if splits are too small, then the overhead of managing the splits and
of map task creation begins to dominate the total job execution time.For most jobs, a
good split size tends to be the size of a HDFS block, 64 MB by default.

Reduce tasks don’t have the advantage of data locality—the input to a single reduce
task is normally the output from all mappers.

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays
to minimize the data transferred between map and reduce tasks. Hadoop allows the
user to specify a combiner function to be run on the map output—the combiner function’s
output forms the input to the reduce function.

Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost
of seeks. By making a block large enough, the time to transfer the data from the disk
can be made to be significantly larger than the time to seek to the start of the block.
Thus the time to transfer a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10ms, and the transfer rate is
100 MB/s, then to make the seek time 1% of the transfer time, we need to make the
block size around 100 MB. The default is actually 64 MB, although many HDFS installations
use 128 MB blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally
operate on one block at a time, so if you have too few tasks (fewer than nodes in the
cluster), your jobs will run slower than they could otherwise.
意思是这样的，Block大的话，寻找Block的时间大概少，主要耗在传输的时间上，但是如果Block小的话，传输的时间和寻址的时间就相当了，等于说就是消耗的时间是2倍传输的时间，划不来。具体的说是，如果数据量为100M，那么Block的大小是100M，那么传输的时间就是1s(100M/s)，但是如果Block的大小是1M，那么传输的时间还是1s，但是seek的时间10ms*100=1s了。这样总共花去的时间就是2s。是不是越大越好呢？也不是，太大的话，可能导致文档没有分布式的存储，也就没有很好的利用MapReduce模型进行计算了，反而可能更慢。

persister 2010-03-12 20:59 发表评论

Lucene数据存储结构中的VInt（可变长度整型）

persister — Tue, 02 Feb 2010 03:08:00 GMT

A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on.

可变格式的整型定义：最高位表示是否还有字节要读取，低七位就是就是具体的有效位，添加到

结果数据中。比如00000001 最高位表示0，那么说明这个数就是一个字节表示，有效位是后面的七位0000001，值为1。10000010 00000001 第一个字节最高位为1，表示后面还有字节，第二位最高位0表示到此为止了，即就是两个字节，那么具体的值注意，是从最后一个字节的七位有效数放在最前面，依次放置，最后是第一个自己的七位有效位，所以这个数表示 0000001 0000010，换算成整数就是130

VInt Encoding Example

Value

First byte

Second byte

Third byte

0

00000000

1

00000001

2

00000010

...

127

01111111

128

10000000

00000001

129

10000001

00000001

130

10000010

00000001

...

16,383

11111111

01111111

16,384

10000000

10000000

00000001

16,385

10000001

10000000

00000001

...

Lucene源代码中进行存储和读取是这样的。OutputStream是负责写：

1   /** Writes an int in a variable-length format.  Writes between one and
2    * five bytes.  Smaller values take fewer bytes.  Negative numbers are not
3    * supported.
4    * @see InputStream#readVInt()
5    */
6   public final void writeVInt(int i) throws IOException {
7     while ((i & ~0x7F) != 0) {
8       writeByte((byte)((i & 0x7f) | 0x80));
9       i >>>= 7;
10     }
11     writeByte((byte)i);
12   }

InputStream负责读：

1   /** Reads an int stored in variable-length format.  Reads between one and
2    * five bytes.  Smaller values take fewer bytes.  Negative numbers are not
3    * supported.
4    * @see OutputStream#writeVInt(int)
5    */
6   public final int readVInt() throws IOException {
7     byte b = readByte();
8     int i = b & 0x7F;
9     for (int shift = 7; (b & 0x80) != 0; shift += 7) {
10       b = readByte();
11       i |= (b & 0x7F) << shift;
12     }
13     return i;
14   }

>>>表示无符号右移

persister 2010-02-02 11:08 发表评论

第一次尝试Nutch

persister — Thu, 23 Jul 2009 07:43:00 GMT

环境：Nutch0.9+Fedora5+tomcat6+JDK6

tomcat和jdk都安装好；

二：nutch-0.9.tar.gz
        将下载到的tar.gz包，解压到/opt目录下并改名：
        #gunzip -xf nutch-0.9.tar.gz |tar xf
        #mv nutch-0.9.tar.gz /usr/local/nutch

       测试环境是否设置成功：运行：/urs/local/nutch/bin/nutch看一下有没有命令参数输出，如果有说明没问题。

       抓取过程：#cd /opt/nutch
                         #mkdir urls
                         #vi nutch.txt 输入www.aicent.net
                         #vi conf/crawl-urlfilter.txt 加入以下信息：利用正则表达式对网站url抓取筛选。
                        /**** accept hosts in MY.DOMAIN.NAME******/
                                +^http://([a-z0-9]*\.)*aicent.net/
                       #vi nutch/nutch-site.xml（给自己的蜘蛛取一个名字）设置如下：


    http.agent.name
    test/unique

开始抓取：#bin/nutch crawl urls -dir crawl -detpth 5 -thread 10 >& crawl.log

等待一会，时间依据网站的大小，和设置的抓取深度。

三：apache-tomcat

在这里，当你看到每次检索的页面为0里，需要修改一下参数，因为tomcat中的nutch的检索路径不对造成的。
#vi /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml

searcher.dir
/opt/nutch/crawl抓取网页所在的路径
My path to nutch's searcher dir.

#/opt/tomcat/bin/startup.sh

OK,搞定。。。

问题汇总：

运行：sh ./bin/nutch crawl urls -dir crawl -depth 3 -threads 60 -topN 100 >&./logs/nutch_log.log

1.Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
网上查有说是JDK版本的问题，不能用JDK1.6，于是安装1.5。但是还是同样的问题，奇怪啊。
于是继续google，发现有如下的可能：

Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

说明：一般为crawl-urlfilters.txt中配置问题，比如过滤条件应为
+^http://www.ihooyo.com ,而配置成了 http://www.ihooyo.com 这样的情况就引起如上错误。

但是自己的配置根本就没有问题啊。
在Logs目录下面除了生成nutch_log.log还自动生成一个log文件：hadoop.log
发现有错误出现：

2009-07-22 22:20:55,501 INFO crawl.Crawl - crawl started in: crawl
2009-07-22 22:20:55,501 INFO crawl.Crawl - rootUrlDir = urls
2009-07-22 22:20:55,502 INFO crawl.Crawl - threads = 60
2009-07-22 22:20:55,502 INFO crawl.Crawl - depth = 3
2009-07-22 22:20:55,502 INFO crawl.Crawl - topN = 100
2009-07-22 22:20:55,603 INFO crawl.Injector - Injector: starting
2009-07-22 22:20:55,604 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-07-22 22:20:55,604 INFO crawl.Injector - Injector: urlDir: urls
2009-07-22 22:20:55,605 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries.
2009-07-22 22:20:56,574 INFO plugin.PluginRepository - Plugins: looking in: /opt/nutch/plugins
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - Registered Plugins:
2009-07-22 22:20:56,720 INFO plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Basic Query Filter (query-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Site Query Filter (query-site)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         HTTP Framework (lib-http)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Http Protocol Plug-in (protocol-http)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         JavaScript Parser (parse-js)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         URL Query Filter (query-url)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Registered Extension-Points:
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2009-07-22 22:20:56,786 WARN regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2009-07-22 22:20:56,829 WARN mapred.LocalJobRunner - job_2319eh
java.lang.RuntimeException: java.net.UnknownHostException: jackliu: jackliu
        at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:617)
        at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:591)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:364)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:390)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.startPartition(MapTask.java:294)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:355)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$100(MapTask.java:231)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:180)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
Caused by: java.net.UnknownHostException: jackliu: jackliu
        at java.net.InetAddress.getLocalHost(InetAddress.java:1353)
        at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:614)
        ... 8 more

也就是Host配置错误，于是：
Add the following to your /etc/hosts file
127.0.0.1 jackliu

这次再次运行，结果成功！

2:http://127.0.0.1:8080/nutch-0.9
输入nutch进行查询，结果报错：
HTTP Status 500 -

type Exception report

message

description The server encountered an internal error () that prevented it from fulfilling this request.

exception

org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value language + "/include/header.html" is quoted with " which must be escaped when used within the value
org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)
org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:407)
org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:198)
org.apache.jasper.compiler.Parser.parseQuoted(Parser.java:299)
org.apache.jasper.compiler.Parser.parseAttributeValue(Parser.java:249)
org.apache.jasper.compiler.Parser.parseAttribute(Parser.java:211)
org.apache.jasper.compiler.Parser.parseAttributes(Parser.java:154)
org.apache.jasper.compiler.Parser.parseInclude(Parser.java:867)
org.apache.jasper.compiler.Parser.parseStandardAction(Parser.java:1134)
org.apache.jasper.compiler.Parser.parseElements(Parser.java:1461)
org.apache.jasper.compiler.Parser.parse(Parser.java:137)
org.apache.jasper.compiler.ParserController.doParse(ParserController.java:255)
org.apache.jasper.compiler.ParserController.parse(ParserController.java:103)
org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:170)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:332)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:312)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:299)
org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:586)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:317)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

note The full stack trace of the root cause is available in the Apache Tomcat/6.0.20 logs.

分析：查看nutch Web应用根目录下的search.jsp可知，是引号匹配的问题。

"/> //line 152 search.jsp

第一个引号和后面第一个出现的引号进行匹配，而不是和这一行最后一个引号进行匹配，所以问题就出现了。

解决方法：

将该行代码修改为：

这里我们定一个字符串urlsuffix，我们把它定义在language字符串定义之后，

String language =   // line 116 search.jsp
    ResourceBundle.getBundle("org.nutch.jsp.search", request.getLocale())
    .getLocale().getLanguage();
String urlsuffix="/include/header.html";

修改完成后，为确保修改成功，重启一下Tomcat服务器，进行搜索，不再报错。

3.无法查询结果？
对比nutch_log.log的结果发现和网上描述的不同，而且crawl里面只有两个文件夹segments和crawldb，后来重新爬了一次
发现这次是好的，奇怪不知道为什么上次爬的失败了。

4.cached.jsp explain.jsp等都有上面3的错误，更正过去就OK了。

5.今天花了一上午和半个下午的时间终于搞定了nutch的安装和配置了。明天继续学习。

persister 2009-07-23 15:43 发表评论

PhraseQuery、SpanQuery和PhrasePrefixQuery

persister — Tue, 14 Jul 2009 01:49:00 GMT

PhraseQuery使用位置信息来进行相关查询，比如TermQuery使用“我们”和“祖国”进行查询，那么文档中含有这两个词的所有记录都会被查询出来。但是有一种情况，我们可能需要查询“我们”和“中国”之间只隔一个字和两个字或者两个字等，而不是它们之间字距相差十万八千里，就可以使用PhraseQuery。比如下面的情况：
    doc.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));
那么：
    String[] phrase = new String[] {"quick", "fox"};
    assertFalse("exact phrase not found", matched(phrase, 0));
    assertTrue("close enough", matched(phrase, 1));
multi-terms:
    assertFalse("not close enough", matched(new String[] {"quick", "jumped", "lazy"}, 3));
    assertTrue("just enough", matched(new String[] {"quick", "jumped", "lazy"}, 4));
    assertFalse("almost but not quite", matched(new String[] {"lazy", "jumped", "quick"}, 7));
    assertTrue("bingo", matched(new String[] {"lazy", "jumped", "quick"}, 8));

数字表示slop，通过如下方式设置，表示按照顺序从第一个字段到第二个字段之间间隔的term个数。
    query.setSlop(slop);

顺序很重要：
    String[] phrase = new String[] {"fox", "quick"};
assertFalse("hop flop", matched(phrase, 2));
assertTrue("hop hop slop", matched(phrase, 3));

原理如下图所示：

对于查询关键字quick和fox，只需要fox移动一个位置即可匹配quick brown fox。而对于fox和quick这两个关键字
需要将fox移动三个位置。移动的距离越大，那么这项记录的score就越小，被查询出来的可能行就越小了。

SpanQuery利用位置信息查询更有意思的查询：

SpanQuery type         Description
SpanTermQuery         Used in conjunction with the other span query types. On its own, it’s
                                        functionally equivalent to TermQuery.
SpanFirstQuery         Matches spans that occur within the first part of a field.
SpanNearQuery         Matches spans that occur near one another.
SpanNotQuery         Matches spans that don’t overlap one another.
SpanOrQuery             Aggregates matches of span queries.

SpanFirstQuery：To query for spans that occur within the first n positions of a field, use Span-FirstQuery.

quick = new SpanTermQuery(new Term("f", "quick"));
brown = new SpanTermQuery(new Term("f", "brown"));
red = new SpanTermQuery(new Term("f", "red"));
fox = new SpanTermQuery(new Term("f", "fox"));
lazy = new SpanTermQuery(new Term("f", "lazy"));
sleepy = new SpanTermQuery(new Term("f", "sleepy"));
dog = new SpanTermQuery(new Term("f", "dog"));
cat = new SpanTermQuery(new Term("f", "cat"));

SpanFirstQuery sfq = new SpanFirstQuery(brown, 2);
assertNoMatches(sfq);
sfq = new SpanFirstQuery(brown, 3);
assertOnlyBrownFox(sfq);

SpanNearQuery：

彼此相邻的跨度

首先，强调一下PhraseQuery对象，这个对象不属于跨度查询类，但能完成跨度查询功能。

匹配到的文档所包含的项通常是彼此相邻的，考虑到原文档中在查询项之间可能有一些中间项，或为了能查询倒排的项，PhraseQuery设置了slop因子，但是这个slop因子指2个项允许最大间隔距离，不是传统意义上的距离，是按顺序组成给定的短语，所需要移动位置的次数，这表示PhraseQuery是必须按照项在文档中出现的顺序计算跨度的，如quick brown fox为文档，则quick fox2个项的slop为1，quick向后移动一次.而fox quick需要quick向后移动3次，所以slop为3

其次，来看一下SpanQuery的子类SpanTermQuery。

它能跨度查询，并且不一定非要按项在文档中出现的顺序，可以用一个独立的标记表示查询对象必须按顺序，或允许按倒过来的顺序完成匹配。匹配的跨度也不是指移动位置的次数，是指从第一个跨度的起始位置到最后一个跨度的结束位置。

在SpanNearQuery中将SpanTermQuery对象作为SpanQuery对象使用的效果，与使用PharseQuery的效果非常相似。在SpanNearQuery的构造函数中的第三个参数为inOrder标志，设置这个标志，表示按项在文档中出现的顺序倒过来的顺序。

如:the quick brown fox jumps over the lazy dog这个文档

public void testSpanNearQuery() throws Exception{

SpanQuery[] quick_brown_dog=new SpanQuery[]{quick,brown,dog};

SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,0,true);//按正常顺序,跨度为0,对三个项进行查询

assertNoMatches(snq);//无法匹配

SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,4,true);//按正常顺序,跨度为4,对三个项进行查询

assertNoMatches(snq);//无法匹配

SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,4,true);//按正常顺序,跨度为5,对三个项进行查询

assertOnlyBrownFox(snq);//匹配成功

SpanNearQuery snq=new SpanNearQuery(new SpanQuery[]{lazy,fox},3,false);//按相反顺序,跨度为3,对三个项进行查询

assertOnlyBrownFox(snq);//匹配成功

//下面使用PhraseQuery进行查询，因为是按顺序，所以lazy和fox必须要跨度为5

PhraseQuery pq=new PhraseQuery();

pq.add(new Term("f","lazy"));

pq.setslop(4);

assertNoMatches(pq);//跨度4无法匹配

//PharseQuery,slop因子为5

pq.setSlop(5);

assertOnlyBrownFox(pq);

      }

3.PhrasePrefixQuery 主要用来进行同义词查询的：
    IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true);
    Document doc1 = new Document();
    doc1.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));
    writer.addDocument(doc1);
    Document doc2 = new Document();
    doc2.add(Field.Text("field","the fast fox hopped over the hound"));
    writer.addDocument(doc2);

    PhrasePrefixQuery query = new PhrasePrefixQuery();
    query.add(new Term[] {new Term("field", "quick"), new Term("field", "fast")});
    query.add(new Term("field", "fox"));

    Hits hits = searcher.search(query);
    assertEquals("fast fox match", 1, hits.length());
    query.setSlop(1);
    hits = searcher.search(query);
    assertEquals("both match", 2, hits.length());

persister 2009-07-14 09:49 发表评论

搜索引擎中对于输入查询关键词的一些考虑

persister — Sat, 11 Jul 2009 09:33:00 GMT

1、首先就是错别字。怎么判断输入的次为错别字呢？或者就算是有错别字也查询去正确的结果。Luncene使用Metaphone algorithm

2、近义词查询。 SynonymAnalyzer和PhrasePrefixQuery都能解决这个问题。

persister 2009-07-11 17:33 发表评论

Analyzer

persister — Tue, 07 Jul 2009 07:59:00 GMT

Primary analyzers available in Lucene
Analyzer                          Steps taken
WhitespaceAnalyzer         Splits tokens at whitespace
SimpleAnalyzer                Divides text at nonletter characters and lowercases
StopAnalyzer        Divides text at nonletter characters, lowercases, and removes stop words
StandardAnalyzer      Tokenizes based on a sophisticated grammar that recognizes
               e-mail addresses, acronyms, Chinese- Japanese-Korean characters,
    alphanumerics， and more; lowercases;and removes stop words

persister 2009-07-07 15:59 发表评论

Porter stemming algorithm

persister — Mon, 06 Jul 2009 14:47:00 GMT

PorterStemFilter
所谓Stemming，可以称为词根化，这里有个overview。在英语这样的拉丁语系里面，单词有多种变形。比如加上-ed、-ing、-ly等等。在分词的时候，如果能够把这些变形单词的词根找出了，对搜索结果是很有帮助的。Stemming算法有很多了，三大主流算法是Porter stemming algorithm、Lovins stemming algorithm、Lancaster (Paice/Husk) stemming algorithm，还有一些改进的或其它的算法。这个PorterStemFilter里面调用的一个PorterStemmer就是Porter Stemming algorithm的一个实现。

persister 2009-07-06 22:47 发表评论

Lucene倒排索引原理

persister — Wed, 10 Jun 2009 10:08:00 GMT

zz:http://blog.donews.com/windshow/archive/2005/11/24/638234.aspx

倒排索引：Inverted index

Lucene是一个高性能的java全文检索工具包，它使用的是倒排文件索引结构。该结构及相应的生成算法如下：

0）设有两篇文章1和2
文章1的内容为：Tom lives in Guangzhou,I live in Guangzhou too.
文章2的内容为：He once lived in Shanghai.

1)由于lucene是基于关键词索引和查询的，首先我们要取得这两篇文章的关键词，通常我们需要如下处理措施
a.我们现在有的是文章内容，即一个字符串，我们先要找出字符串中的所有单词，即分词。英文单词由于用空格分隔，比较好处理。中文单词间是连在一起的需要特殊的分词处理。
b.文章中的”in”, “once” “too”等词没有什么实际意义，中文中的“的”“是”等字通常也无具体含义，这些不代表概念的词可以过滤掉
c.用户通常希望查“He”时能把含“he”，“HE”的文章也找出来，所以所有单词需要统一大小写。
d.用户通常希望查“live”时能把含“lives”，“lived”的文章也找出来，所以需要把“lives”，“lived”还原成“live”
e.文章中的标点符号通常不表示某种概念，也可以过滤掉
在lucene中以上措施由Analyzer类完成

经过上面处理后
    文章1的所有关键词为：[tom] [live] [guangzhou] [i] [live] [guangzhou]
    文章2的所有关键词为：[he] [live] [shanghai]

2) 有了关键词后，我们就可以建立倒排索引了。上面的对应关系是：“文章号”对“文章中所有关键词”。倒排索引把这个关系倒过来，变成：“关键词”对“拥有该关键词的所有文章号”。文章1，2经过倒排后变成
关键词   文章号
guangzhou  1
he         2
i           1
live       1,2
shanghai   2
tom         1

通常仅知道关键词在哪些文章中出现还不够，我们还需要知道关键词在文章中出现次数和出现的位置，通常有两种位置：a)字符位置，即记录该词是文章中第几个字符（优点是关键词亮显时定位快）；b)关键词位置，即记录该词是文章中第几个关键词（优点是节约索引空间、词组（phase）查询快），lucene中记录的就是这种位置。

加上“出现频率”和“出现位置”信息后，我们的索引结构变为：
关键词   文章号[出现频率]   出现位置
guangzhou 1[2]               3，6
he       2[1]               1
i         1[1]               4
live      1[2],2[1]           2，5，2
shanghai  2[1]               3
tom      1[1]               1

以live这行为例我们说明一下该结构：live在文章1中出现了2次，文章2中出现了一次，它的出现位置为“2,5,2”这表示什么呢？我们需要结合文章号和出现频率来分析，文章1中出现了2次，那么“2,5”就表示live在文章1中出现的两个位置，文章2中出现了一次，剩下的“2”就表示live是文章2中第2个关键字。

以上就是lucene索引结构中最核心的部分。我们注意到关键字是按字符顺序排列的（lucene没有使用B树结构），因此lucene可以用二元搜索算法快速定位关键词。

实现时 lucene将上面三列分别作为词典文件（Term Dictionary）、频率文件(frequencies)、位置文件(positions)保存。其中词典文件不仅保存有每个关键词，还保留了指向频率文件和位置文件的指针，通过指针可以找到该关键字的频率信息和位置信息。

    Lucene中使用了field的概念，用于表达信息所在位置（如标题中，文章中，url中），在建索引中，该field信息也记录在词典文件中，每个关键词都有一个field信息(因为每个关键字一定属于一个或多个field)。

    为了减小索引文件的大小，Lucene对索引还使用了压缩技术。首先，对词典文件中的关键词进行了压缩，关键词压缩为<前缀长度，后缀>，例如：当前词为“阿拉伯语”，上一个词为“阿拉伯”，那么“阿拉伯语”压缩为<3，语>。其次大量用到的是对数字的压缩，数字只保存与上一个值的差值（这样可以减小数字的长度，进而减少保存该数字需要的字节数）。例如当前文章号是16389（不压缩要用3个字节保存），上一文章号是16382，压缩后保存7（只用一个字节）。

    下面我们可以通过对该索引的查询来解释一下为什么要建立索引。
假设要查询单词 “live”，lucene先对词典二元查找、找到该词，通过指向频率文件的指针读出所有文章号，然后返回结果。词典通常非常小，因而，整个过程的时间是毫秒级的。
而用普通的顺序匹配算法，不建索引，而是对所有文章的内容进行字符串匹配，这个过程将会相当缓慢，当文章数目很大时，时间往往是无法忍受的。

自我评论：
还可以参考http://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95

二元搜索算法
在排好序的数组中找到特定的元素。
首先, 比较数组中间的元素，如果相同，则返回此元素的指针，表示找到了。如果不相同，此函数就会继续搜索其中大小相符的一半，然后继续下去。如果剩下的数组长度为0，则表示找不到，那么函数就会结束。
此算法函数如下：

int *binarySearch(int val, int array[], int n)

{

int m = n/2;

if(n <= 0) return NULL;

if(val == array[m]) return array + m;

if(val < array[m]) return binarySearch(val, array, m);

else return binarySearch(val, array+m+1, n-m-1);

}

对于有n个元素的数组来说，二元搜索算法进行最多1+log2(n)次比较。如果有一百万元素，大概比较20次，也就是最多20次递归执行binarySearch()函数。

persister 2009-06-10 18:08 发表评论

Lucene学习index

persister — Tue, 09 Jun 2009 15:33:00 GMT

1.Adding documents to an index：
protected String[] keywords = {"1", "2"};
protected String[] unindexed = {"Netherlands", "Italy"};
protected String[] unstored = {"Amsterdam has lots of bridges", "Venice has lots of canals"};
protected String[] text = {"Amsterdam", "Venice"};
Directory dir = FSDirectory.getDirectory(indexDir, true);
IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true);
writer.setUseCompoundFile(true);
for (int i = 0; i < keywords.length; i++) {
  Document doc = new Document();
  doc.add(Field.Keyword("id", keywords[i]));
  doc.add(Field.UnIndexed("country", unindexed[i]));
  doc.add(Field.UnStored("contents", unstored[i]));
  doc.add(Field.Text("city", text[i]));
  writer.addDocument(doc);
}
writer.optimize();
writer.close();
2.Removing Documents from an index：
IndexReader reader = IndexReader.open(dir);
reader.delete(1);
上面的方式一次只能删除一个document，下面的方法可以删除多个满足条件的document
IndexReader reader = IndexReader.open(dir);
reader.delete(new Term("city", "Amsterdam"));
reader.close();

3.Index dates
Document doc = new Document();
doc.add(Field.Keyword("indexDate", new Date()));

4.Tuning indexing performance
IndexWriter          System property                            Default value          Description
--------------------------------------------------------------------------------------------------
mergeFactor          org.apache.lucene.mergeFactor        10       Controls segment merge frequency and size
maxMergeDocs     org.apache.lucene.maxMergeDocs   Integar.MAX_VALUE    Limits the number of documents per segement
minMergeDocs        org.apache.lucene.minMergeDocs     10     Controls the amount of   RAM used when indexing

mergeFactor控制写入硬盘前内存中缓存的document数量，同时控制merge index segments的频率。其默认值是10，即存满10个
documents后就必须写入硬盘，而且如果segment的数量达到10的级数的时候会merge成一个segment，当然maxMergeDocs限制了每个
segment最大能够保存的document数量。mergeFactor越大的话就越能利用RAM，提高index的效率，但是mergeFactor越高也就意味着
merge的频率就越低，会可能导致segments的数量很大（因为没有merge），这样search的时候就需要打开更多的segment文件，也就
降低了search的效率。minMergeDocs is another IndexWriter instance variable that affects indexing performance. Its
value controls how many Documents have to be buffered before they’re merged to a segment.也即是说minMergeDocs也具有
mergeFactor控制缓存document数量的功能。

5.RAMDirectory帮助利用RAM，也可以采用集群或者多线程的方式充分利用硬件和软件资源，提高index的效率。

6.有时候对于每个field可能希望控制其大小，比如只对前1000个term做index，这个时候就需要使用maxFieldLength来控制。

7.IndexWriter’s optimize()方法就是将segments进行merge，降低segments的数量从而减少search的时候读取index的时间。

8.注意多线程环境下的工作：an index-modifying IndexReader operation can’t be executed
while an index-modifying IndexWriter operation is in progress.为了防止误用，Lucene在使用某些API时会给
index上锁。

persister 2009-06-09 23:33 发表评论

Lucene的Query

persister — Mon, 08 Jun 2009 02:05:00 GMT

Lucene基本的查询语句：
Searcher searcher = new IndexSearcher(dbpath);
Query query = QueryParser.parse(searchkey, searchfield,
new StandardAnalyzer());
Hits hits = searcher.search(query);
下面是Query的各种子查询，他们斗鱼QueryParser都有对应关系。

1.TermQuery常用，对一个Term（最小的索引块，包含一个field名字和值）进行索引查询。
Term直接与QueryParser.parse里面的key和field直接对应。

IndexSearcher searcher = new IndexSearcher(directory);
Term t = new Term("isbn", "1930110995");
Query query = new TermQuery(t);
Hits hits = searcher.search(query);

2.RangeQuery用于区间查询,RangeQuery的第三个参数表示是开区间还是闭区间。
QueryParser会构建从begin到end之间的N个查询进行查询。

Term begin, end;
Searcher searcher = new IndexSearcher(dbpath);
begin = new Term("pubmonth","199801");
end = new Term("pubmonth","199810");
RangeQuery query = new RangeQuery(begin, end, true);

RangeQuery本质是比较大小。所以如下查询也是可以的，但是意义就于上面不大一样了，总之是大小的比较
设定了一个区间，在区间内的都能够搜索出来，这里就存在一个比较大小的原则，比如字符串会首先比较第一个字符，这样与字符长度没有关系。
begin = new Term("pubmonth","19");
end = new Term("pubmonth","20");
RangeQuery query = new RangeQuery(begin, end, true);

3.PrefixQuery.对于TermQuery，必须完全匹配（用Field.Keyword生成的字段）才能够查询出来。
这就制约了查询的灵活性，PrefixQuery只需要匹配value的前面任何字段即可。如Field为name，记录
中那么有jackliu,jackwu,jackli,那么使用jack就可以查询出所有的记录。QueryParser creates a PrefixQuery
for a term when it ends with an asterisk (*) in query expressions.

IndexSearcher searcher = new IndexSearcher(directory);
Term term = new Term("category", "/technology/computers/programming");
PrefixQuery query = new PrefixQuery(term);
Hits hits = searcher.search(query);

4.BooleanQuery.上面所有的查询都是基于单个field的查询，多个field怎么查询呢，BooleanQuery
就是解决多个查询的问题。通过add(Query query, boolean required, boolean prohibited)加入
多个查询.通过BooleanQuery的嵌套可以组合非常复杂的查询。

IndexSearcher searcher = new IndexSearcher(directory);
TermQuery searchingBooks =
new TermQuery(new Term("subject","search"));

RangeQuery currentBooks =
new RangeQuery(new Term("pubmonth","200401"),
new Term("pubmonth","200412"),true);

BooleanQuery currentSearchingBooks = new BooleanQuery();
currentSearchingBooks.add(searchingBook s, true, false);
currentSearchingBooks.add(currentBooks, true, false);
Hits hits = searcher.search(currentSearchingBooks);

BooleanQuery的add方法有两个boolean参数：
true＆false：表明当前加入的子句是必须要满足的；
false＆true：表明当前加入的子句是不可以被满足的；
false＆false：表明当前加入的子句是可选的；
true＆true：错误的情况。

QueryParser handily constructs BooleanQuerys when multiple terms are specified.
Grouping is done with parentheses, and the prohibited and required flags are
set when the –, +, AND, OR, and NOT operators are specified.

5.PhraseQuery进行更为精确的查找。它能够对索引文本中的两个或更多的关键词的位置进行
限定。如搜查包含A和B并且A、B之间还有一个文字。Terms surrounded by double quotes in
QueryParser parsed expressions are translated into a PhraseQuery.
The slop factor defaults to zero, but you can adjust the slop factor
by adding a tilde (~) followed by an integer.
For example, the expression "quick fox"~3

6.WildcardQuery.WildcardQuery比PrefixQuery提供了更细的控制和更大的灵活性，这个最容易
理解和使用。

7.FuzzyQuery.这个Query比较特别，它会查询与关键字长得很像的其他记录。QueryParser
supports FuzzyQuery by suffixing a term with a tilde (~),for exmaple wuzza~.

public void testFuzzy() throws Exception {
  indexSingleFieldDocs(new Field[] {
  Field.Text("contents", "fuzzy"),
  Field.Text("contents", "wuzzy")
  });
  IndexSearcher searcher = new IndexSearcher(directory);
  Query query = new FuzzyQuery(new Term("contents", "wuzza"));
  Hits hits = searcher.search(query);
  assertEquals("both close enough", 2, hits.length());
  assertTrue("wuzzy closer than fuzzy",
  hits.score(0) != hits.score(1));
  assertEquals("wuzza bear","wuzzy", hits.doc(0).get("contents"));
}

persister 2009-06-08 10:05 发表评论

Lucene学习

persister — Fri, 06 Mar 2009 03:03:00 GMT

今天将“Lucene学习”里面的程序贴到eclipse工程里实现了一下
加深了我对检索的理解
在全文检索中，可以和数据库进行一个简单的对比
全文检索没有表的概念，也就没有固定的fields，但是有记录，每一个记录就是一个Document对象
每一个document都可以有自己不同的fields，如下：

    Document doc = new Document();

   doc.add(Field.Keyword("filename",file.getAbsolutePath()));

   //以下两句只能取一句,前者是索引不存储,后者是索引且存储
   //doc.add(Field.Text("content",new FileReader(file)));
   doc.add(Field.Text("content",this.chgFileToString(file)));

   indexWriter.addDocument(doc);

在查询的时候，需要三个重要的参数
首先是库路径，即在哪个库里面进行检索（相当于database的路径）：

Searcher searcher = new IndexSearcher(dbpath);

然后就是你以哪个字段，查询什么关键词，因为根据字段就可以得到字段对应的内容
在得到的内容中检索你的关键词，这个累死sql语句，只不过没有表的概念
Query query
    = QueryParser.parse(searchkey,searchfield,new StandardAnalyzer());

然后开始查询，查询的结果就是document的集合：
   Hits hits = searcher.search(query);

对得到的集合进行处理：

   if(hits != null)
{
       list = new ArrayList();
       int temp_hitslength = hits.length();
       Document doc = null;
     for(int i = 0;i < temp_hitslength; i++){
           doc = hits.doc(i);
           //list.add(doc.get("filename"));
           list.add(doc.get("content"));
       }
   }

附常用Field：

常用的Field方法如下：

方法	切词	索引	存储	用途
Field.Text(String name, String value)	Yes	Yes	Yes	切分词索引并存储，比如：标题，内容字段
Field.Text(String name, Reader value)	Yes	Yes	No	切分词索引不存储，比如：META信息，不用于返回显示，但需要进行检索内容
Field.Keyword(String name, String value)	No	Yes	Yes	不切分索引并存储，比如：日期字段
Field.UnIndexed(String name, String value)	No	No	Yes	不索引，只存储，比如：文件路径
Field.UnStored(String name, String value)	Yes	Yes	No	只全文索引，不存储

切分词就是指对文本进行切词，用于进行索引，上面可以看到切分的都会进行索引；索引即用于通过搜索词进行查询；存储表示是否存储内容本身。上面的 Field.Keyword方法就不切分但是可以索引，所以可以用这个字段进行查询，而Field.UnIndexed就不能进行查询了。但是由于 Field.Keyword不切分，所以当使用new Term(searchkey,searchfield)进行查询时，给出的searchkey必须与vaue参数值完全一致才会查询出来，而 Field.Text和Field.UnStored则就不一样。

Lucene中国是一个非常好的网站，对Lucene内部结构进行了详细的分析，可以参考。

persister 2009-03-06 11:03 发表评论

Value	First byte	Second byte	Third byte
0	00000000
1	00000001
2	00000010
...
127	01111111
128	10000000	00000001
129	10000001	00000001
130	10000010	00000001
...
16,383	11111111	01111111
16,384	10000000	10000000	00000001
16,385	10000001	10000000	00000001
...

BlogJava-Java天空 任我翱翔-随笔分类-Lucene,Nutch,Hadoop