Lucene全文检索实践

Lucene 是 Apache 的一个子项目，是一个全文检索的搜索引擎库。其提供了简单实用的 API，通过这些 API，可以自行编写对文件（TEXT／XML／HTML等）、目录、数据库的全文检索程序。

Features：
* Very fast indexing, minimal RAM required
* Index compression to 30% of original text
* Indexes text and HTML, document classes available for XML, PDF and RTF
* Search supports phrase and Boolean queries, plus, minus and quote marks, and parentheses
* Allows single and multiple character wildcards anywhere in the search words, fuzzy search, proximity
* Will search for punctuation such as + or
* Field searches for title, author, etc., and date-range searching
* Supports most European languages
* Option to store and display full text of indexed documents
* Search results in relevance order
* APIs for file format conversion, languages and user interfaces

实践任务：
1) 编写 Java 程序 MyIndexer.java，使用 JDBC 取出 MySQL 数据表内容（以某一论坛数据做测试），然后通过 org.apache.lucene.index.IndexWriter 创建索引。
2) 编写 Java 程序 MySearcher.java，通过 org.apache.lucene.search.IndexSearcher 等查询索引。
3) 实现支持中文查询及检索关键字高亮显示。
4) 通过 PHP / Java Integration 实现对 MySearch.java 的调用。
5) 实现对 PHP 手册（简体中文） 的全文检索。

Java 的程序基本编写完成，实现了对中文的支持。下一步是将其放到 WEB 上运行，首先想到的是使用 JSP，安装了Apache Tomcat/4.1.24，默认的发布端口是 8080。现在面临的一个问题是：Apache httpd 的端口是 80，并且我的机器对外只能通过 80 端口进行访问，如果将 Tomcat 的发布端口改成 80 的话，httpd 就没法对外了，而其上的 PHP 程序也将无法在 80 端口运行。

对于这个问题，我想到两种方案：
1、使用 PHP 直接调用 Java。需要做的工作是使用 --with-java 重新编译 PHP；
2、使用 mod_jk 做桥接的方式，将 servlet 引擎结合到 httpd 中。需要做的工作是编译 jakarta-tomcat-connectors-jk-1.2.5-src，生成 mod_jk.so 给 httpd 使用，然后按照 Howto 文档进行 Tomcat、httpd 的配置。

对于第一个方案的尝试：使用 PHP 直接调用 Java

环境
* PHP 4.3.6 prefix=/usr
* Apache 1.3.27 prefix=/usr/local/apache
* j2sdk1.4.1_01 prefix=/usr/local/jdk

配置步骤
1) 安装 JDK，这个就不多说了，到 GOOGLE 可以搜索出这方面的大量文章。

2) 重新编译 PHP，我的 PHP 版本是 4.3.6：

cd php-4.3.6

./configure --with-java=/usr/local/jdk

make

make install

完成之后，会在 PHP 的 lib 下（我的是在 /usr/lib/php）有个 php_java.jar，同时在扩展动态库存放的目录下（我的是在 /usr/lib/php/20020429）有个 java.so 文件。到这一步需要注意一个问题，有些 PHP 版本生成的是 libphp_java.so 文件，extension 的加载只认 libphp_java.so，直接加载 java.so 可能会出现如下错误：

PHP Fatal error: Unable to load Java Library /usr/local/jdk/jre/lib/i386/libjava.so, error: libjvm.so:

cannot open shared object file: No such file or directory in /home/nio/public_html/java.php on line 2

所以如果生成的是 java.so，需要创建一个符号连接：

ln -s java.so libphp_java.so

3) 修改 Apache Service 启动文件（我的这个文件为 /etc/init.d/httpd），在这个文件中加入：

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/jdk/jre/lib/i386/server:/usr/local/jdk/jre/lib/i386

正如你所看到的，我的 JDK 装在 /usr/local/jdk 目录下，如果你的不是在此目录，请做相应改动（下同）。

4) 修改 PHP 配置文件 php.ini，找到 [Java] 部分进行修改：

[Java]

java.class.path = /usr/lib/php/php_java.jar

java.home = /usr/local/jdk

;java.library =

;java.library.path =

extension_dir=/usr/lib/php/20020429/

extension=java.so

我将 java.library 及 java.library.path 都注释掉了，PHP 会自动认为 java.library=/usr/local/jdk/jre/lib/i386/libjava.so。

5) 重新启动 Apache httpd 服务：

service httpd restart

测试
测试脚本 java.php 源代码：

getProperty('java.version').'<br />';

print 'Java vendor=' . $system->getProperty('java.vendor').'<br />';

print 'OS=' . $system->getProperty('os.name') . ' ' .

$system->getProperty('os.version') . ' on ' .

$system->getProperty('os.arch') . '<br />';

总结
安装配置还算简单，但是在 PHP 运行 Java 的速度感觉较慢，所以下定决心开始实践第二个方案。

今天总算有些空闲时间，正好说说第二种方案：使用 mod_jk 做桥接的方式，将 servlet 引擎结合到 httpd 中。

环境
* PHP 4.3.6 prefix=/usr
* Apache 1.3.27 prefix=/usr/local/apache
* j2sdk1.4.1_01 prefix=/usr/local/jdk
* jakarta-tomcat-4.1.24 prefix=/usr/local/tomcat
* 另外需要下载 jakarta-tomcat-connectors-jk-1.2.5-src.tar.gz

配置步骤
1) 安装 JDK 与 Tomcat，这些安装步骤就不多说了。

2) 编译 jakarta-tomcat-connectors-jk-1.2.5-src，生成 mod_jk.so，并将其复制到 apache 的 modules 存放目录：

tar xzf jakarta-tomcat-connectors-jk-1.2.5-src.tar.gz

cd jakarta-tomcat-connectors-jk-1.2.5-src/jk/native

./configure --with-apxs=/usr/local/apache/bin/apxs

make

cp apache-1.3/mod_jk.so /usr/local/apache/libexec

3) 编辑 Apache 配置文件 /usr/local/apache/conf/httpd.conf，加入：

LoadModule jk_module libexec/mod_jk.so

AddModule mod_jk.c

这个 LoadModule 语句最好放在其他 LoadModule 语句后边。
同时在配置文件后边加入：

# workers.properties 文件所在路径，后边将对此文件进行讲解

JkWorkersFile /usr/local/apache/conf/workers.properties

# jk 的日志文件存放路径

JkLogFile /usr/local/apache/log/mod_jk.log

# 设置 jk 的日志级别 [debug/error/info]

JkLogLevel info

# 选择日志时间格式

JkLogStampFormat "[%a %b %d %H:%M:%S %Y] "

# JkOptions 选项设置

JkOptions +ForwardKeySize +ForwardURICompat -ForwardDirectories

# JkRequestLogFormat 设置日志的请求格式

JkRequestLogFormat "%w %V %T"

# 映射 /examples/* 到 worker1，worker1 在 workers.properties 文件中定义

JkMount /examples/* worker1

4) 在 /usr/local/apache/conf/ 目录下创建 workers.properties 文件，其内容如下：

# 定义使用 ajp13 的 worker1

worker.list=worker1

# 设置 worker1 的属性（ajp13）

worker.worker1.type=ajp13

worker.worker1.host=localhost

worker.worker1.port=8009

worker.worker1.lbfactor=50

worker.worker1.cachesize=10

worker.worker1.cache_timeout=600

worker.worker1.socket_keepalive=1

worker.worker1.socket_timeout=300

5) 好了，启动 Tomcat，重启一下 Apache HTTPD Server，访问：http://localhost/examples/index.jsp，看看结果如何，和 http://localhost:8080/examples/index.jsp 是一样的。

提示：如果不想让别人通过 8080 端口访问到你的 Tomcat，可以将 /usr/lcoal/tomcat/conf/server.xml 配置文件中的如下代码加上注释：

<!--

<Connector className="org.apache.coyote.tomcat4.CoyoteConnector"

port="8080" minProcessors="5" maxProcessors="75"

enableLookups="false" redirectPort="8443"

acceptCount="100" debug="0" connectionTimeout="20000"

useURIValidationHack="false" disableUploadTimeout="true" />

-->

然后重新启动 Tomcat 即可。

总结
此方案安装配置稍微复杂些，但执行效率要比第一种方案要好很多。所以决定使用这种方案来完成我的 Lucene 全文检索实践任务。

对于 Lucene 的初步研究已经过去一段时间，自己感觉还不是很深入，但由于时间的关系，一直也没再拿起。应网友的要求，将自己实践中写的一些代码贴出来，希望能对大家有用。程序没有做进一步的优化，只是很简单的实现功能而已，仅供参考。

在实践中，我以将 PHP 中文手册中的 HTML 文件生成索引，然后通过一个 JSP 对其进行全文检索。
生成索引的 Java 代码：

/**

* PHPDocIndexer.java

* 用于对 PHPDoc 的 HTML 页面生成索引文件。

import java.io.File;

import java.io.FileReader;

import java.io.BufferedReader;

import java.io.IOException;

import java.util.Date;

import java.text.DateFormat;

import java.lang.*;

import org.apache.lucene.analysis.cjk.CJKAnalyzer;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.document.DateField;

class PHPDocIndexer

{

public static void main(String[] args) throws ClassNotFoundException, IOException

{

try {

Date start = new Date();

IndexWriter writer = new IndexWriter("/home/nio/indexes-phpdoc", new CJKAnalyzer(), true); //索引保存目录，必须存在

indexDocs(writer, new File("/home/nio/phpdoc-zh")); //HTML 文件保存目录

System.out.println("Optimizing ....");

writer.optimize();

writer.close();

Date end = new Date();

System.out.print("Total time: ");

System.out.println(end.getTime() - start.getTime());

} catch (Exception e) {

System.out.println("Class " + e.getClass() + " throws error!/n errmsg: " + e.getMessage());

} //end try

} //end main

public static void indexDocs(IndexWriter writer, File file) throws Exception

{

if (file.isDirectory()) {

String[] files = file.list();

for (int i = 0; i < files.length; i++) {

indexDocs(writer, new File(file, files[i]));

} //end for

} else if (file.getPath().endsWith(".html")) { //只对 HTML 文件做索引

System.out.print("Add file:" + file + " ....");

// Add html file ....

Document doc = new Document();

doc.add(Field.UnIndexed("file", file.getName())); //索引文件名

doc.add(Field.UnIndexed("modified", DateFormat.getDateTimeInstance().format(new Date(file.lastModified())))); //索引最后修改时间

String title = "";

String content = "";

String status = "start";

FileReader fReader = new FileReader(file);

BufferedReader bReader = new BufferedReader(fReader);

String line = bReader.readLine();

while (line != null) {

content += line;

//截取 HTML 标题 <title>

if ("start" == status && line.equalsIgnoreCase("><TITLE")) {

status = "match";

} else if ("match" == status) {

title = line.substring(1, line.length() - 7);

doc.add(Field.Text("title", title)); //索引标题

status = "end";

} //end if

line = bReader.readLine();

} //end while

bReader.close();

fReader.close();

doc.add(Field.Text("content", content.replaceAll("<[^<>]+>", ""))); //索引内容

writer.addDocument(doc);

System.out.println(" [OK]");

} //end if

}

} //end class

索引生成完之后，就需要一个检索页面，下边是搜索页面（search.jsp）的代码：

<%@ page language="java" import="javax.servlet.*, javax.servlet.http.*, java.io.*, java.util.Date, java.util.ArrayList, java.util.regex.*, org.apache.lucene.analysis.*, org.apache.lucene.document.*, org.apache.lucene.index.*, org.apache.lucene.search.*, org.apache.lucene.queryParser.*, org.apache.lucene.analysis.Token, org.apache.lucene.analysis.TokenStream, org.apache.lucene.analysis.cjk.CJKAnalyzer, org.apache.lucene.analysis.cjk.CJKTokenizer, com.chedong.weblucene.search.WebLuceneHighlighter" %>

<%@ page contentType="text/html;charset=GB2312" %>

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"

"http://www.w3.org/TR/html4/strict.dtd">

<html>

<head>

<title>PHPDoc - PHP 简体中文手册全文检索</title>

<style>

body {background-color: white; margin: 4px}

body, input, div {font-family: Tahoma; font-size: 9pt}

body, div {line-height: 18px}

u {color: red}

b {color: navy}

form {padding: 0px; margin: 0px}

.txt {border: 1px solid black}

.f {padding: 4px; margin-bottom: 16px; background-color: #E5ECF9; border-top: 1px solid #3366CC; border-bottom: 1px solid #3366CC; text-align: center;}

.d, .o {padding-left: 16px}

.d {color: gray}

.o {color: green}

.o a {color: #7777CC}

</style>

function gotoPage(i)

{

document.frm.page.value = i;

document.frm.submit();

} //end function

</script>

</head>

<body>

String keyVal = null;

String pageVal = null;

int offset = 0;

int curPage = 0;

int pages;

final int ROWS = 50;

//获取 GET 参数

try {

byte[] keyValByte = request.getParameter("key").getBytes("ISO8859_1"); //查找关键字

keyVal = new String(keyValByte);

pageVal = request.getParameter("page"); //页码

} catch (Exception e) {

//do nothing;

}

if (keyVal == null)

keyVal = new String("");

<font color="green">提示：可使用多个关键字（使用空格隔开）提高搜索的准确率。</font>

</form>

document.frm.key.focus();

</script>

</div>

if (keyVal != null && keyVal.length() > 0) {

try {

curPage = Integer.parseInt(pageVal); //将当前页转换成整数

} catch (Exception e) {

//do nothing;

} //end try

try {

Date startTime = new Date();

keyVal = keyVal.toLowerCase().replaceAll("(or|and)", "").trim().replaceAll("//s+", " AND ");

Searcher searcher = new IndexSearcher("/home/nio/indexes-phpdoc"); //索引目录

Analyzer analyzer = new CJKAnalyzer();

String[] fields = {"title", "content"};

Query query = MultiFieldQueryParser.parse(keyVal, fields, analyzer);

Hits hits = searcher.search(query);

StringReader in = new StringReader(keyVal);

TokenStream tokenStream = analyzer.tokenStream("", in);

ArrayList al = new ArrayList();

for (Token token = tokenStream.next(); token != null; token = tokenStream.next()) {

al.add(token.termText());

} //end for

//总页数

pages = (new Integer(hits.length()).doubleValue() % ROWS != 0) (hits.length() / ROWS) + 1 : (hits.length() / ROWS);

//当前页码

if (curPage < 1)

curPage = 1;

else if (curPage > pages)

curPage = pages;

//起始、终止下标

offset = (curPage - 1) * ROWS;

int end = Math.min(hits.length(), offset + ROWS);

//循环输出查询结果

WebLuceneHighlighter hl = new WebLuceneHighlighter(al);

for (int i = offset; i < end; i++) {

Document doc = hits.doc(i);

/~nio/phpdoc-zh/<%=doc.get("file")%>

<%=doc.get("modified")%>

</div>

<br />

} //end for

searcher.close();

Date endTime = new Date();

检索总共耗时 <b><%=((endTime.getTime() - startTime.getTime()) / 1000.0)%></b> 秒，约有 <b><%=hits.length()%></b> 项符合条件的记录，共 <b><%=pages%></b> 页

if (curPage > 1 && pages > 1) {

|<a href="javascript:gotoPage(<%=(curPage-1)%>);" target="_self">上一页</a>

} //end if

if (curPage < pages && pages > 1) {

|<a href="javascript:gotoPage(<%=(curPage+1)%>)" target="_self">下一页</a>

} //end if

} catch (Exception e) {

} //end if

</body>

</html>

posted on 2009-03-03 11:31 草原上的骆驼阅读(2020) 评论(10) 编辑收藏所属分类: 搜索服务

# re: Lucene全文检索实践回复 更多评论

I got various thoughts that correspond to optimization, but, I don't realize the real way to realize them! Maybe, I will utilize the blog posting services. Maybe that can help!

2012-02-10 13:33 | blog posting service

# re: Lucene全文检索实践回复 更多评论

Your good release related to this topic goes side by side with the dissertation references. Therefore, you could work for thesis service.

2012-02-13 16:15 | thesis writing

# re: Lucene全文检索实践回复 更多评论

Thank you a lot for the smashing issue just about this good post. I could not see such kind of thesis example in web and tried to purchase the dissertation. Thence, I have all material at present time.

2012-02-13 18:48 | dissertation writing

# re: Lucene全文检索实践回复 更多评论

If you try to find place where you can buy term papers or buy research papers here is very proficient place for you about essays writing, which bring examples and gives an excuse to learn how make research . But this site is more arresting, and more crucial. So don't be lazy and write your own or buy essays about this good topic. Thanks.

2012-03-12 20:32 | buy research papers

# re: Lucene全文检索实践回复 更多评论

If you opine that to tell professionals: " write my academic paper " is not a fair, then you can be wrong! The writing firms are invented to support people and that is not a felony if you help yourself!

2012-03-25 11:36 | essay writing example

# re: Lucene全文检索实践回复 更多评论

Yeah assuredly very helpful for the editors it was pleasant to read about this good post! If you need to get a great job firstofall you need professional resume service. Study and don't forget - if you have to work and study at the same time, there areold pros who are ready to assist you with your resume when you under time heaviness and looking for a great job.

2012-03-26 20:43 | resume services

# re: Lucene全文检索实践回复 更多评论

I find her blog and she is such an inspiration delight, ecstasy, elation, elevation, euphoria, excitement to me. This is really great that you are doing a king of superb I think that some persons can purchase the essay service with you help.

2012-03-26 21:40 | online essays

# re: Lucene全文检索实践回复 更多评论

This sould be very executable to purchase thesis research referring to this topic at the thesis writing service peculiarly if people are lack of time.

2012-03-30 13:20 | thesis

# re: Lucene全文检索实践回复 更多评论

Great potentials are hidden behind closed doors of us. Perfect writers write papers of extra high quality! Thence, don't worry and get custom research papers for sale will not withdraw when you sit deciding nothing! Become ingenious or just get custom research papers for sale.

2012-08-14 06:23 | essay service

# re: Lucene全文检索实践 回复 更多评论

必须同意你是见过的最酷博客之一。谢谢你发布有用的信息。这是只什么我在寻找。我就回来到此博客肯定！

2013-02-22 16:43 | website

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园博客园最新博文博问管理
相关文章: Lucene 索引数据库2 用Lucene索引数据库 Lucene索引查询分页实例 Lucene基本使用介绍为lucene加入简单中文分词用Lucene检索数据库 Lucene全文检索实践

# re: Lucene全文检索实践回复 更多评论

# re: Lucene全文检索实践回复 更多评论

# re: Lucene全文检索实践回复 更多评论

# re: Lucene全文检索实践回复 更多评论

# re: Lucene全文检索实践回复 更多评论

# re: Lucene全文检索实践回复 更多评论

# re: Lucene全文检索实践回复 更多评论

# re: Lucene全文检索实践回复 更多评论

# re: Lucene全文检索实践回复 更多评论

# re: Lucene全文检索实践 回复 更多评论

大漠驼铃