posts - 40, comments - 7, trackbacks - 0

1. 实现一个简单的search feature

在本章中只限于讨论简单Lucene 搜索API, 有下面几个相关的类:

Lucene 基本搜索API:

类	功能
IndexSearcher	搜索一个index的入口.所有的searches都是通过IndexSearcher 实例的几个重载的方法实现的.
Query (and subclasses)	各个子类封装了特定搜索类型的逻辑(logic),Query实例传递给IndexSearcher的search方法.
QueryParser	处理一个可读的表达式,转换为一个具体的Query实例.
Hits	包含了搜索的结果.有IndexSearcher的search函数返回.

下面我们来看几个书中的例子:

LiaTestCase.java 一个继承自 TestCase 并且扩展了 TestCase 的类 , 下面的几个例子都继承自该类 .

01 package lia.common;
02
03 import junit.framework.TestCase;
04 import org.apache.lucene.store.FSDirectory;
05 import org.apache.lucene.store.Directory;
06 import org.apache.lucene.search.Hits;
07 import org.apache.lucene.document.Document;
08
09 import java.io.IOException;
10 import java.util.Date;
11 import java.text.ParseException;
12 import java.text.SimpleDateFormat;
13
14 /**
15   * LIA base class for test cases.
16   */
17 public abstract class LiaTestCase extends TestCase {
18    private String indexDir = System.getProperty("index.dir"); // 测试 index 已经建立好了
19    protected Directory directory;
20
21    protected void setUp() throws Exception {
22      directory = FSDirectory.getDirectory(indexDir, false);
23    }
24
25    protected void tearDown() throws Exception {
26      directory.close();
27    }
28
29    /**
30     * For troubleshooting 为了解决问题的方法
31     */
32    protected final void dumpHits(Hits hits) throws IOException {
33      if (hits.length() == 0) {
34        System.out.println("No hits");
35      }
36
37      for (int i=0; i < hits.length(); i++) {
38        Document doc = hits.doc(i);
39        System.out.println(hits.score(i) + ":" + doc.get("title"));
40      }
41    }
42
43    protected final void assertHitsIncludeTitle(
44                                            Hits hits, String title)
45      throws IOException {
46      for (int i=0; i < hits.length(); i++) {
47        Document doc = hits.doc(i);
48        if (title.equals(doc.get("title"))) {
49          assertTrue(true);
50          return;
51        }
52      }
53
54      fail("title '" + title + "' not found");
55    }
56
57    protected final Date parseDate(String s) throws ParseException {
58        return new SimpleDateFormat("yyyy-MM-dd").parse(s);
59    }
60 }

I. 搜索一个特定的Term 和利用QueryParser 解析用户输入的表达式

要利用一个特定的term搜索,使用QueryTerm就可以了,单个term 尤其适合Keyword搜索. 解析用户输入的表达式可以更适合用户的使用方式,搜索表达式的解析有QueryParser来完成.如果表达式解析错误会有异常抛出, 可以取得相信的错误信息以便给用户适当的提示.在解析表达式时,还需要一个Analyzer 来分析用户的输入, 并根据不同的Analyzer来生产相应的Term然后构成Query实例.

下面看个例子吧: BasicSearchingTest.java

01 package lia.searching;
02
03 import lia.common.LiaTestCase;
04 import org.apache.lucene.analysis.SimpleAnalyzer;
05 import org.apache.lucene.document.Document;
06 import org.apache.lucene.index.Term;
07 import org.apache.lucene.queryParser.QueryParser;
08 import org.apache.lucene.search.Hits;
09 import org.apache.lucene.search.IndexSearcher;
10 import org.apache.lucene.search.Query;
11 import org.apache.lucene.search.TermQuery;
12
13 public class BasicSearchingTest extends LiaTestCase {
14
15    public void testTerm() throws Exception {
16      IndexSearcher searcher = new IndexSearcher(directory);
17      Term t = new Term("subject", "ant");                // 构造一个 Term
18      Query query = new TermQuery(t);
19      Hits hits = searcher.search(query);                 // 搜索
20      assertEquals("JDwA", 1, hits.length());             // 测试结果
21
22      t = new Term("subject", "junit");
23      hits = searcher.search(new TermQuery(t));
24      assertEquals(2, hits.length());
25
26      searcher.close();
27    }
28
29    public void testKeyword() throws Exception { // 测试关键字搜索
30      IndexSearcher searcher = new IndexSearcher(directory);
31      Term t = new Term("isbn", "1930110995");                 // 关键字 term
32      Query query = new TermQuery(t);
33      Hits hits = searcher.search(query);
34      assertEquals("JUnit in Action", 1, hits.length());
35    }
36
37    public void testQueryParser() throws Exception { // 测试 QueryParser.
38      IndexSearcher searcher = new IndexSearcher(directory);
39
40      Query query = QueryParser.parse("+JUNIT +ANT -MOCK",
41                                      "contents",
42                                      new SimpleAnalyzer()); // 通过解析搜索表达式返回一个 Query 实例
43      Hits hits = searcher.search(query);
44      assertEquals(1, hits.length());
45      Document d = hits.doc(0);
46      assertEquals("Java Development with Ant", d.get("title"));
47
48      query = QueryParser.parse("mock OR junit",
49                                "contents",
50                                new SimpleAnalyzer());              // 通过解析搜索表达式返回一个 Query 实例
51      hits = searcher.search(query);
52      assertEquals("JDwA and JIA", 2, hits.length());
53    }
54 }

2. 使用IndexSearcher

既然IndexSearcher 是那么的重要下面我们来看看如何使用吧. 在构造IndexSearcher时有两种方法:

■ By Directory
■ By a file system path

推荐使用Directory 这样就会Index 存放的位置无关了, 在上面的 LiaTestCase.java 中我们构造了一个 Directory:

directory = FSDirectory.getDirectory(indexDir, false );

利用她构造一个 IndexSearch :

IndexSearcher searcher = new IndexSearcher(directory);

然后可以利用 searcher的search方法来搜索了 (有6个重载的方法,参考doc 看看什么时候使用合适:) ,然后可以得到Hits, Hits中包含了搜索的结果下面来看看Hits吧:

I.Working with Hits

Hits 有4个方法, 如下

Hits methods for efficiently accessing search results
Hits method	Return value
length()	Number of documents in the Hits collection
doc(n)	Document instance of the nth top-scoring document
id(n)	Document ID of the nth top-scoring document
score(n)	Normalized score (based on the score of the topmost document) of the nth top-scoring document, guaranteed to be greater than 0 and less than or equal to 1

通过这几个方法可以得到搜索结果的相关信息, Hits也会caches 一些Documents 以便提升性能, 默认caches 前100的被认为常用的结果.

注意:

The methods doc(n), id(n), and score(n) require documents to be loaded
from the index when they aren’t already cached. This leads us to recommend
only calling these methods for documents you truly need to display or access;
defer calling them until needed.

II.Paging through Hits

在 Paging Hits时用两种方法可以使用:

■ Keep the original Hits and IndexSearcher instances available while theuser is navigating the search results.
■ Requery each time the user navigates to a new page.

推荐使用第二种 ,这样基于无状态协议时会简单些,如Http 搜索(google search)

III.reading index into memory

有时为了充分利用系统资源,提高性能可以把index 读入到内存中搜索, 如:

RAMDirectory ramDir = new RAMDirectory(dir);

该构造函数有几个重载实现,根据不同的数据来源构造RAMDirectory 看看doc.

3.Understanding Lucene Scoring

Lucene 搜索返回的Hits中的结果根据默认的Score 排序,该score 是根据如下公式计算的.

上面公式的参数解释如下:

Factor	Description
tf(t in d)	Term frequency factor for the term (t) in the document (d).
idf(t)	Inverse document frequency of the term.
boost(t.field in d)	Field boost, as set during indexing.
lengthNorm(t.field in d)	Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index.
coord(q, d)	Coordination factor, based on the number of query terms the document contains.
queryNorm(q)	Normalization value for a query, given the sum of the squared weights of each of the query terms.

关于Score的更多内容参考 Similarity 类的 docs.

通过 Explanation 类可以了解到 document 的各个 score 的参数细节 , 用 toString 函数可以打印出来 , 可以有 IndexSearch 得到 Explanation: 如下 :

01 package lia.searching;
02
03 import org.apache.lucene.analysis.SimpleAnalyzer;
04 import org.apache.lucene.document.Document;
05 import org.apache.lucene.queryParser.QueryParser;
06 import org.apache.lucene.search.Explanation;
07 import org.apache.lucene.search.Hits;
08 import org.apache.lucene.search.IndexSearcher;
09 import org.apache.lucene.search.Query;
10 import org.apache.lucene.store.FSDirectory;
11
12 public class Explainer {
13    public static void main(String[] args) throws Exception {
14      if (args.length != 2) {
15        System.err.println("Usage: Explainer <index dir> <query>");
16        System.exit(1);
17      }
18
19      String indexDir = args[0];
20      String queryExpression = args[1];
21
22      FSDirectory directory =
23          FSDirectory.getDirectory(indexDir, false);
24
25      Query query = QueryParser.parse(queryExpression,
26          "contents", new SimpleAnalyzer());
27
28      System.out.println("Query: " + queryExpression);
29
30      IndexSearcher searcher = new IndexSearcher(directory);
31      Hits hits = searcher.search(query);
32
33      for (int i = 0; i < hits.length(); i++) {
34        Explanation explanation =                  // Generate Explanation of single Document for query
35                                searcher.explain(query, hits.id(i));
36
37        System.out.println("----------");
38        Document doc = hits.doc(i);
39        System.out.println(doc.get("title"));
40        System.out.println(explanation.toString()); // 打印出来结果
41      }
42    }
43 }

结果如下:

Query: junit

----------

JUnit in Action

0.65311843 = fieldWeight(contents:junit in 2), product of:

1.4142135 = tf(termFreq(contents:junit)=2) // (1)junit 在 contents 中出现两次

1.8472979 = idf(docFreq=2)

0.25 = fieldNorm(field=contents, doc=2)

----------

Java Development with Ant

0.46182448 = fieldWeight(contents:junit in 1), product of:

1.0 = tf(termFreq(contents:junit)=1) // (2)junit 在 contents 中出现一次

1.8472979 = idf(docFreq=2)

0.25 = fieldNorm(field=contents, doc=1)

(1) JUnit in Action has the term junit twice in its contents field. The contents field in

our index is an aggregation of the title and subject fields to allow a single field

for searching.

(2) Java Development with Ant has the term junit only once in its contents field.

还可以使用toHtml 方法转换为Html代码, Nutch 项目的核心就是利用Explanation(请参考Nutch 项目文档).

4.creating queries programmatically

IndexSearch 的search函数需要一个Query实例, Query有不同的子类,分别应用不同的场合,下面来看看各种Query:

TermQuery
TermQuery 最简单(上文提到过), 用Term t=new Term("contents","junit"); new TermQuery(t)就可以构造
TermQuery把查询条件视为一个keyword, 要求和查询内容完全匹配,比如Field.Keyword类型就可以使用TermQuery

RangeQuery
RangeQuery 看名字就知道是表示一个范围的搜索条件,RangeQuery query = new RangeQuery(begin, end, included);
boolean参数表示是否包含边界条件本身, 用字符表示为"[begin TO end]"()包含边界值或者"{begin TO end}"(不包含边界值)

PrefixQuery
顾名思义,就是表示以XX开头的查询, 字符表示为"something*"

BooleanQuery
逻辑组合的Query,你可以把各种Query添加进去并标明他们的逻辑关系,添加条件用如下方法

public void add(Query query, boolean required, boolean prohibited)

后两个boolean变量是标示AND OR NOT三种关系(如果同时取true的话是不和逻辑的哦 ) 字符表示为" AND OR NOT" 或 "+ -" ,一个BooleanQuery中可以添加多个Query, 如果超过setMaxClauseCount(int)的值(默认1024个)的话,会抛出TooManyClauses错误.

表3:两个参数的组合

		required
		false		true
prohibited	false	Clause is optional		Clause must match
prohibited	true	Clause must not	match	Invalid

PhraseQuery
表示不严格语句的查询,比如"quick fox"要匹配"quick brown fox","quick brown high fox"等,PhraseQuery所以提供了一个setSlop()参数,在查询中,lucene会尝试调整单词的距离和位置,这个参数表示可以接受调整次数限制,如果实际的内容可以在这么多步内调整为完全匹配,那么就被视为匹配.在默认情况下slop的值是0, 所以默认是不支持非严格匹配的, 通过设置slop参数(比如"quick fox"匹配"quick brown fox"就需要1个slop来把fox后移动1位),我们可以让lucene来模糊查询. 值得注意的是,PhraseQuery不保证前后单词的次序,在上面的例子中,"fox quick"需要2个slop,也就是如果slop如果大于等于2,那么"fox quick"也会被认为是匹配的.如果是多个Term的搜索,slop指最大的所以的用到次数.看个例子就更明白了:

01 package lia.searching;
02
03 import junit.framework.TestCase;
04 import org.apache.lucene.analysis.WhitespaceAnalyzer;
05 import org.apache.lucene.document.Document;
06 import org.apache.lucene.document.Field;
07 import org.apache.lucene.index.IndexWriter;
08 import org.apache.lucene.index.Term;
09 import org.apache.lucene.search.Hits;
10 import org.apache.lucene.search.IndexSearcher;
11 import org.apache.lucene.search.PhraseQuery;
12 import org.apache.lucene.store.RAMDirectory;
13
14 import java.io.IOException;
15
16 public class PhraseQueryTest extends TestCase {
17    private IndexSearcher searcher;
18
19    protected void setUp() throws IOException {
20      // set up sample document
21      RAMDirectory directory = new RAMDirectory();
22      IndexWriter writer = new IndexWriter(directory,
23          new WhitespaceAnalyzer(), true);
24      Document doc = new Document();
25      doc.add(Field.Text("field",
26                "the quick brown fox jumped over the lazy dog"));
27      writer.addDocument(doc);
28      writer.close();
29
30      searcher = new IndexSearcher(directory);
31    }
32
33    private boolean matched(String[] phrase, int slop)
34        throws IOException {
35      PhraseQuery query = new PhraseQuery();
36      query.setSlop(slop);
37
38      for (int i=0; i < phrase.length; i++) {
39        query.add(new Term("field", phrase[i]));
40      }
41
42      Hits hits = searcher.search(query);
43      return hits.length() > 0;
44    }
45
46    public void testSlopComparison() throws Exception {
47      String[] phrase = new String[] {"quick", "fox"};
48
49      assertFalse("exact phrase not found", matched(phrase, 0));
50
51      assertTrue("close enough", matched(phrase, 1));
52    }
53
54    public void testReverse() throws Exception {
55      String[] phrase = new String[] {"fox", "quick"};
56
57      assertFalse("hop flop", matched(phrase, 2));
58      assertTrue("hop hop slop", matched(phrase, 3));
59    }
60
61    public void testMultiple() throws Exception {     // 测试多个 Term 的搜索
62      assertFalse("not close enough",
63          matched(new String[] {"quick", "jumped", "lazy"}, 3));
64
65      assertTrue("just enough",
66          matched(new String[] {"quick", "jumped", "lazy"}, 4));
67
68      assertFalse("almost but not quite",
69          matched(new String[] {"lazy", "jumped", "quick"}, 7));
70
71      assertTrue("bingo",
72          matched(new String[] {"lazy", "jumped", "quick"}, 8));
73
74    }
75
76 }

WildcardQuery
使用?(0或者一个字符)和*(0 或者多个字符)来表示,比如?ild*可以匹配 wild ,mild ,wildcard ...,值得注意的是,在wildcard中,只要是匹配上的纪录,他们的相关度都是一样的,比如wildcard 和mild的对于?ild的相关度就是一样的.

FuzzyQuery
他能模糊匹配英文单词,比如fuzzy和wuzzy他们可以看成类似, 对于英文的各种时态变化和复数形式,这个FuzzyQuery还算有用,匹配结果的相关度是不一样的.字符表示为 "fuzzy~".特别是你忘记了一个单词如何写了的时候最为有用, 比如用google search 来搜索liceue google 在搜索不到结果时候会提醒你是不是搜索Lucene . 但是这个Query对中文没有什么用处.

5.parsing query expressions: QueryParser

对于一个让普通用户使用的产品来说,使用搜索表达式还是比较人性化的.下面看看如何使用QueryParser来处理搜索表达式.

注意: Whenever special characters are used in a query expression, you need to provide an escaping mechanism so that the special characters can be used in a normal fashion. QueryParser uses a backslash (\) to escape special characters within terms. The escapable characters are as follows: \ + - ! ( ) : ^ ] { } ~ * ? (特殊字符要用转移字符表示)

QueryParser 把用户输入的各种查询条件转为Query, 利用Query's toString方法可以打印出QueryParser解析后的等价的结果.通过该方式可以了解 QueryParser是否安装你的意愿工作.注意: QueryParser用到了Analyzer,不同的Analyzer可能会忽略stop word,所以QueryParser parse过后的Query再toString未必和原来的String一样.

boolean 操作:

用or and not (或者+ - )表示 ,很容易理解

分组:Groupping
比如"(a AND b) or c",就是括号分组,也很容易理解

域选择:FieldSelectiong
QueryParser的查询条件是对默认的Field进行的, 它在QueryParser解析的时候编码指定, 如果用户需要在查询条件中选用另外的Field, 可以使用如下语法: fieldname:a, 如果是多个分组,可以用fieldname:(a b c)表示.
　

范围搜索:range search

使用[ begin TO end](包括边界条件) 和 {begin TO end} 实现.

注意: Nondate range queries use the beginning and ending terms as the user entered them, without modification. In other words, the beginning and ending terms are not analyzed. Start and end terms must not contain whitespace, or parsing fails. In our example index, the field pubmonth isn’t a date field; it’s text of the format YYYYMM.

在处理日期时可以通过QueryParser的setLocale方法设置地区处理I18N问题. 见下面的例子:

Phrase query:

用双引号引住的字符串可以创建一个PhraseQuery, 在隐含之间的内容被分析后创建Query可能把一些Stop word 忽略掉.如下:

094    public void testPhraseQuery() throws Exception {
095      Query q = QueryParser.parse("\"This is Some Phrase*\"", // this is 在 StandardAnalyzer 中为 stop word
096          "field", new StandardAnalyzer());
097      assertEquals("analyzed",
098          "\"some phrase\"", q.toString("field"));   // 没有 this is 出现
099
100      q = QueryParser.parse("\"term\"", "field", analyzer);
101      assertTrue("reduced to TermQuery", q instanceof TermQuery);
102    }

通配符搜索
关于通配符搜索注意:QueryParser默认不允许*号出现在开始部分，这样做的目的主要是为了防止用户误输入* 从而导致严重的性能问题

Fuzzy query:

~ 结尾代表一个Fuzzy.

关于使用通配符和模糊搜索都有不同的性能问题.以后会讨论到

boosting query

通过使用符号^后面跟个浮点值可以设置该term的boost值.如: junit^2.0 testing 设置 junit TermQuery 的boost值为 2.0
而testing TermQuery的boost值还是默认值1.0. 大家可以试试google search 有没有该特性. :)

QueryParser 确实很好友但是不是总是适合你的情况来看看作者的观点吧:

To QueryParse or not to QueryParse?

QueryParser is a quick and effortless way to give users powerful query construction,

but it isn’t right for all scenarios. QueryParser can’t create every type of

query that can be constructed using the API . In chapter 5, we detail a handful of

API -only queries that have no QueryParser expression capability. You must keep

in mind all the possibilities available when exposing free-form query parsing to

an end user; some queries have the potential for performance bottlenecks, and

the syntax used by the built-in QueryParser may not be suitable for your needs.

You can exert some limited control by subclassing QueryParser (see section 6.3.1).

Should you require different expression syntax or capabilities beyond what

QueryParser offers, technologies such as ANTLR 7 and JavaCC 8 are great options.

We don’t discuss the creation of a custom query parser; however, the source code

for Lucene’s QueryParser is freely available for you to borrow from.

You can often obtain a happy medium by combining a QueryParser -parsed

query with API -created queries as clauses in a BooleanQuery . This approach is

demonstrated in section 5.5.4. For example, if users need to constrain searches

to a particular category or narrow them to a date range, you can have the user

interface separate those selections into a category chooser or separate daterange

fields.

OK ch3 到此就结束了现在可以在Application中添加其本的搜索功能了.庆贺啊!

来个总结:)

Lucene rapidly provides highly relevant search results to queries. Most applications

need only a few Lucene classes and methods to enable searching. The most

fundamental things for you to take from this chapter are an understanding of

the basic query types (of which TermQuery , RangeQuery , and BooleanQuery are the

primary ones) and how to access search results.

Although it can be a bit daunting, Lucene’s scoring formula (coupled with the

index format discussed in appendix B and the efficient algorithms) provides the

magic of returning the most relevant documents first. Lucene’s QueryParser

parses human-readable query expressions, giving rich full-text search power to

end users. QueryParser immediately satisfies most application requirements;

however, it doesn’t come without caveats, so be sure you understand the rough

edges. Much of the confusion regarding QueryParser stems from unexpected

analysis interactions; chapter 4 goes into great detail about analysis, including

常用链接

留言簿(3)

随笔分类

随笔档案

文章分类

文章档案

Lansing's Download

Lansing's Link

我的博客

搜索

最新评论

阅读排行榜

评论排行榜


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: Lucene In Action Ch6 笔记 Lucene In Action Ch4 笔记 Lucene In Action Ch3 笔记 Lucene In Action Ch2 笔记 ORACLE 全文索引功能实现全文索引—CONTAINS语法基于Java的全文索引/检索引擎——Lucene