BlogJava-

︻┳═一Java

-随笔分类-Lucene

当前几个主要的Lucene中文分词器的比较【转载】

Eric.Zhou — Sun, 09 Aug 2009 02:15:00 GMT

转载地址：http://www.javaeye.com/news/9637

1. 基本介绍：

paoding ：Lucene中文分词“庖丁解牛” Paoding Analysis
imdict ：imdict智能词典所采用的智能中文分词程序
mmseg4j ：用 Chih-Hao Tsai 的 MMSeg 算法实现的中文分词器
ik ：采用了特有的“正向迭代最细粒度切分算法“，多子处理器分析模式

2. 开发者及开发活跃度：

paoding ：qieqie.wang， google code 上最后一次代码提交：2008-06-12，svn 版本号 132
imdict ：XiaoPingGao，进入了 lucene contribute，lucene trunk 中 contrib/analyzers/smartcn/ 最后一次提交：2009-07-24，
mmseg4j ：chenlb2008，google code 中 2009-08-03 （昨天），版本号 57，log为：mmseg4j-1.7 创建分支
ik ：linliangyi2005，google code 中 2009-07-31，版本号 41

3. 用户自定义词库：

paoding ：支持不限制个数的用户自定义词库，纯文本格式，一行一词，使用后台线程检测词库的更新，自动编译更新过的词库到二进制版本，并加载
imdict ：暂时不支持用户自定义词库。但原版 ICTCLAS 支持。支持用户自定义 stop words
mmseg4j ：自带sogou词库，支持名为 wordsxxx.dic， utf8文本格式的用户自定义词库，一行一词。不支持自动检测。 -Dmmseg.dic.path
ik ：支持api级的用户词库加载，和配置级的词库文件指定，无 BOM 的 UTF-8 编码，\r\n 分割。不支持自动检测。

4. 速度（基于官方介绍，非自己测试）

paoding ：在PIII 1G内存个人机器上，1秒可准确分词 100万 汉字
imdict ：483.64 (字节/秒)，259517(汉字/秒)
mmseg4j ： complex 1200kb/s左右, simple 1900kb/s左右
ik ：具有50万字/秒的高速处理能力

5. 算法和代码复杂度

paoding ：svn src 目录一共1.3M，6个properties文件，48个java文件，6895 行。使用不用的 Knife 切不同类型的流，不算很复杂。
imdict ：词库 6.7M（这个词库是必须的），src 目录 152k，20个java文件，2399行。使用 ICTCLAS HHMM隐马尔科夫模型，“利用大量语料库的训练来统计汉语词汇的词频和跳转概率，从而根据这些统计结果对整个汉语句子计算最似然(likelihood)的切分”
mmseg4j ： svn src 目录一共 132k，23个java文件，2089行。MMSeg 算法，有点复杂。
ik ： svn src 目录一共6.6M(词典文件也在里面)，22个java文件，4217行。多子处理器分析，跟paoding类似，歧义分析算法还没有弄明白。

6. 文档

paoding ：几乎无。代码里有一些注释，但因为实现比较复杂，读代码还是有一些难度的。
imdict ：几乎无。 ICTCLAS 也没有详细的文档，HHMM隐马尔科夫模型的数学性太强，不太好理解。
mmseg4j ： MMSeg 算法是英文的，但原理比较简单。实现也比较清晰。
ik ：有一个pdf使用手册，里面有使用示例和配置说明。

7. 其它

paoding ：引入隐喻，设计比较合理。search 1.0 版本就用的这个。主要优势在于原生支持词库更新检测。主要劣势为作者已经不更新甚至不维护了。
imdict ：进入了 lucene trunk，原版 ictclas 在各种评测中都有不错的表现，有坚实的理论基础，不是个人山寨。缺点为暂时不支持用户词库。
mmseg4j ：在complex基础上实现了最多分词(max-word)，但是还不成熟，还有很多需要改进的地方。
ik ：针对Lucene全文检索优化的查询分析器IKQueryParser

8. 结论

个人觉得，可以在 mmseg4j 和 paoding 中选一个。关于这两个分词效果的对比，可以参考：

http://blog.chenlb.com/2009/04/mmseg4j-max-word-segment-compare-with-paoding-in-effect.html

或者自己再包装一下，将 paoding 的词库更新检测做一个单独的模块实现，然后就可以在所有基于词库的分词算法之间无缝切换了。

ps，对不同的 field 使用不同的分词器是一个可以考虑的方法。比如 tag 字段，就应该使用一个最简单的分词器，按空格分词就可以了。

Eric.Zhou 2009-08-09 10:15 发表评论

Lucene全文检索小试

Eric.Zhou — Mon, 29 Jan 2007 01:57:00 GMT

HTML 解析器
package com.rain.util;

import Java.io.FileInputStream;
import Java.io.FileNotFoundException;
import Java.io.IOException;
import Java.io.InputStream;
import Java.io.InputStreamReader;
import Java.io.Reader;
import Java.io.UnsupportedEncodingException;

import org.apache.lucene.demo.html.HTMLParser;

public class HTMLDocParser {

private String htmlPath;
private HTMLParser htmlParser;

public HTMLDocParser(String htmlPath){
  this.htmlPath=htmlPath;
  initHtmlParser();
}
public void initHtmlParser(){
  InputStream inputStream=null;
  try{
   inputStream=new FileInputStream(htmlPath);
  }catch(FileNotFoundException e){
   e.printStackTrace();
  }
  if(null!=inputStream){
   try{
    htmlParser=new HTMLParser(new InputStreamReader(inputStream,"utf-8"));
   }catch(UnsupportedEncodingException e){
    e.printStackTrace();
   }
  }
}
public String getTitle(){
  if(null!=htmlParser){
   try{
    return htmlParser.getTitle();
   }catch(IOException e){
    e.printStackTrace();
   }catch(InterruptedException e){
    e.printStackTrace();
   }
  }
  return "";
}
public Reader getContent(){
  if(null!=htmlParser){
   try{
    return htmlParser.getReader();
   }catch(IOException e){
    e.printStackTrace();
   }
  }
  return null;
}
public String getPath(){
  return this.htmlPath;
}
}

描述搜索结果的结构实体Bean
package com.rain.search;

public class SearchResultBean {
    private String htmlPath;

    private String htmlTitle;

public String getHtmlPath() {
return htmlPath;
}

public void setHtmlPath(String htmlPath) {
this.htmlPath = htmlPath;
}

public String getHtmlTitle() {
return htmlTitle;
}

public void setHtmlTitle(String htmlTitle) {
this.htmlTitle = htmlTitle;
}
}

索引子系统的实现

package com.rain.index;

import Java.io.File;
import Java.io.IOException;
import Java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.document.Field;

import com.rain.util.HTMLDocParser;

public class IndexManager {

//the directory that stores HTML files
private final String dataDir="E:\\dataDir";

//the directory that is used to store a Lucene index
private final String indexDir="E:\\indexDir";

public boolean creatIndex()throws IOException{
  if(true==inIndexExist()){
   return true;
  }
  File dir=new File(dataDir);
  if(!dir.exists()){
   return false;
  }
  File[] htmls=dir.listFiles();
  Directory fsDirectory=FSDirectory.getDirectory(indexDir,true);
  Analyzer analyzer=new StandardAnalyzer();
  IndexWriter indexWriter=new IndexWriter(fsDirectory,analyzer,true);
  for(int i=0;i   String htmlPath=htmls[i].getAbsolutePath();
   if(htmlPath.endsWith(".html")||htmlPath.endsWith("htm")){
    addDocument(htmlPath,indexWriter);
   }
  }
  indexWriter.optimize();
  indexWriter.close();
  return true;
}

public void addDocument(String htmlPath,IndexWriter indexWriter){
  HTMLDocParser htmlParser=new HTMLDocParser(htmlPath);
  String path=htmlParser.getPath();
  String title=htmlParser.getTitle();
  Reader content=htmlParser.getContent();

  Document document=new Document();
  document.add(new Field("path",path,Field.Store.YES,Field.Index.NO));
  document.add(new Field("title",title,Field.Store.YES,Field.Index.TOKENIZED));
     document.add(new Field("content",content));
     try{
     indexWriter.addDocument(document);
     }catch(IOException e){
     e.printStackTrace();
     }
}
public String getDataDir(){
  return this.dataDir;
}

public String getIndexDir(){
  return this.indexDir;
}

public boolean inIndexExist(){
  File directory=new File(indexDir);
  if(0   return true;
  }else{
   return false;
  }
}
}

搜索功能的实现
package com.rain.search;

import Java.io.IOException;
import Java.util.ArrayList;
import Java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;

import com.rain.index.IndexManager;

public class SearchManager {
private String searchWord;
private IndexManager indexManager;
private Analyzer analyzer;

public SearchManager(String searchWord){
  this.searchWord=searchWord;
  this.indexManager=new IndexManager();
  this.analyzer=new StandardAnalyzer();
}

/**
     * do search
     */
public List search(){
  List searchResult=new ArrayList();
  if(false==indexManager.inIndexExist()){
   try{
    if(false==indexManager.creatIndex()){
     return searchResult;
    }
   }catch(IOException e){
    e.printStackTrace();
    return searchResult;
   }
  }
  IndexSearcher indexSearcher=null;
  try{
   indexSearcher=new IndexSearcher(indexManager.getIndexDir());
  }catch(IOException e){
   e.printStackTrace();
  }
  QueryParser queryParser=new QueryParser("content",analyzer);
  Query query=null;
  try{
   query=queryParser.parse(searchWord);
  }catch(ParseException e){
   e.printStackTrace();
  }
  if(null!=query&&null!=indexSearcher){
   try{
    Hits hits=indexSearcher.search(query);
    for(int i=0;i     SearchResultBean resultBean=new SearchResultBean();
     resultBean.setHtmlPath(hits.doc(i).get("path"));
     resultBean.setHtmlTitle(hits.doc(i).get("title"));
     searchResult.add(resultBean);
    }
   }catch(IOException e){
    e.printStackTrace();
   }
  }
   return searchResult;
}

}

请求管理器的实现

package com.rain.servlet;

import Java.io.IOException;
import Java.util.List;

import javax.servlet.RequestDispatcher;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import com.rain.search.SearchManager;

/**
* @author zhourui
* 2007-1-28
*/
public class SearchController extends HttpServlet {
private static final long serialVersionUID=1L;

/* (non-Javadoc)
* @see javax.servlet.http.HttpServlet#doPost(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse)
*/
@Override
protected void doPost(HttpServletRequest arg0, HttpServletResponse arg1) throws ServletException, IOException {
  // TODO Auto-generated method stub
  String searchWord=arg0.getParameter("searchWord");
  SearchManager searchManager=new SearchManager(searchWord);
  List searchResult=null;
  searchResult=searchManager.search();
  RequestDispatcher dispatcher=arg0.getRequestDispatcher("search.jsp");
  arg0.setAttribute("searchResult",searchResult);
        dispatcher.forward(arg0, arg1);
}

}

向Web服务器提交搜索请求







            SearchWord:





显示搜索结果

      <%
        List searchResult=(List)request.getAttribute("searchResult");
        int resultCount=0;
        if(null!=searchResult){
        resultCount=searchResult.size();
        }
        for(int i=0;i        SearchResultBean resultBean=(SearchResultBean)searchResult.get(i);
        String title=resultBean.getHtmlTitle();
        String path=resultBean.getHtmlPath();
        %>



        <%
        }
      %>

<%=title%>

Eric.Zhou 2007-01-29 09:57 发表评论

Lucene基本使用介绍

Eric.Zhou — Sun, 28 Jan 2007 02:38:00 GMT

一. 概述

随着系统信息的越来越多，怎么样从这些信息海洋中捞起自己想要的那一根针就变得非常重要了，全文检索是通常用于解决此类问题的方案，而Lucene则为实现全文检索的工具，任何应用都可通过嵌入它来实现全文检索。

二. 环境搭建

从lucene.apache.org上下载最新版本的lucene.jar，将此jar作为项目的build path，那么在项目中就可以直接使用lucene了。

三. 使用说明

3.1. 基本概念

这里介绍的主要为在使用中经常碰到一些概念，以大家都比较熟悉的数据库来进行类比的讲解，使用Lucene进行全文检索的过程有点类似数据库的这个过程，table---à查询相应的字段或查询条件----à返回相应的记录，首先是IndexWriter，通过它建立相应的索引表，相当于数据库中的table，在构建此索引表时需指定的为该索引表采用何种方式进行构建，也就是说对于其中的记录的字段以什么方式来进行格式的划分，这个在Lucene中称为Analyzer，Lucene提供了几种环境下使用的Analyzer：SimpleAnalyzer、StandardAnalyzer、GermanAnalyzer等，其中StandardAnalyzer是经常使用的，因为它提供了对于中文的支持，在表建好后我们就需要往里面插入用于索引的记录，在Lucene中这个称为Document，有点类似数据库中table的一行记录，记录中的字段的添加方法，在Lucene中称为Field，这个和数据库中基本一样，对于Field Lucene分为可被索引的，可切分的，不可被切分的，不可被索引的几种组合类型，通过这几个元素基本上就可以建立起索引了。在查询时经常碰到的为另外几个概念，首先是Query，Lucene提供了几种经常可以用到的Query：TermQuery、MultiTermQuery、BooleanQuery、WildcardQuery、PhraseQuery、PrefixQuery、PhrasePrefixQuery、FuzzyQuery、RangeQuery、SpanQuery，Query其实也就是指对于需要查询的字段采用什么样的方式进行查询，如模糊查询、语义查询、短语查询、范围查询、组合查询等，还有就是QueryParser，QueryParser可用于创建不同的Query，还有一个MultiFieldQueryParser支持对于多个字段进行同一关键字的查询，IndexSearcher概念指的为需要对何目录下的索引文件进行何种方式的分析的查询，有点象对数据库的哪种索引表进行查询并按一定方式进行记录中字段的分解查询的概念，通过IndexSearcher以及Query即可查询出需要的结果，Lucene返回的为Hits.通过遍历Hits可获取返回的结果的Document，通过Document则可获取Field中的相关信息了。

比较一下Lucene和数据库：

Lucene	数据库
索引数据源：doc(field1,field2...) doc(field1,field2...) \ indexer / _____________ \| Lucene Index\| -------------- / searcher \ 结果输出：Hits(doc(field1,field2) doc(field1...))	索引数据源：record(field1,field2...) record(field1..) \ SQL: insert/ _____________ \| DB Index \| ------------- / SQL: select \ 结果输出：results(record(field1,field2..) record(field1...))
Document：一个需要进行索引的“单元” 一个Document由多个字段组成	Record：记录，包含多个字段
Field：字段	Field：字段
Hits：查询结果集，由匹配的Document组成	RecordSet：查询结果集，由多个Record组成

通过对于上面在建立索引和全文检索的基本概念的介绍希望能让你对Lucene建立一定的了解。

需要熟悉几个接口：
分析器Analyzer
        分析器主要工作是筛选，一段文档进来以后，经过它，出去的时候只剩下那些有用的部分，其他则剔除。而这个分析器也可以自己根据需要而编写。
        org.apache.lucene.analysis.Analyzer：这是一个虚构类，以下两个借口均继承它而来。
        org.apache.lucene.analysis.SimpleAnalyzer：分析器，支持最简单拉丁语言。
        org.apache.lucene.analysis.standard.StandardAnalyzer：标准分析器，除了拉丁语言还支持亚洲语言，并在一些匹配功能上进行完善。在这个接口中还有一个很重要的构造函数：StandardAnalyzer(String[] stopWords)，可以对分析器定义一些使用词语，这不仅可以免除检索一些无用信息，而且还可以在检索中定义禁止的政治性、非法性的检索关键词。
IndexWriter
        IndexWriter的构造函数有三种接口，针对目录Directory、文件File、文件路径String三种情况。
例如IndexWriter(String path, Analyzer a, boolean create)，path为文件路径，a为分析器，create标志是否重建索引（true：建立或者覆盖已存在的索引，false：扩展已存在的索引。）
       一些重要的方法：

接口名	备注
addDocument(Document doc)	索引添加一个文档
addIndexes(Directory[] dirs)	将目录中已存在索引添加到这个索引
addIndexes(IndexReader[] readers)	将提供的索引添加到这个索引
optimize()	合并索引并优化
close()	关闭

　
IndexWriter为了减少大量的io维护操作，在每得到一定量的索引后建立新的小索引文件（笔者测试索引批量的最小单位为10），然后再定期将它们整合到一个索引文件中，因此在索引结束时必须进行wirter.optimize()，以便将所有索引合并优化。
org.apache.lucene.document
以下介绍两种主要的类：
a）org.apache.lucene.document.Document：
Document文档类似数据库中的一条记录，可以由好几个字段（Field）组成，并且字段可以套用不同的类型（详细见b）。Document的几种接口：

接口名	备注
add(Field field)	添加一个字段（Field）到Document中
String get(String name)	从文档中获得一个字段对应的文本
Field getField(String name)	由字段名获得字段值
Field[] getFields(String name)	由字段名获得字段值的集

b）org.apache.lucene.document.Field
        即上文所说的“字段”，它是Document的片段section。
        Field的构造函数：
       Field(String name, String string, boolean store, boolean index, boolean token)。
        Indexed：如果字段是Indexed的，表示这个字段是可检索的。
        Stored：如果字段是Stored的，表示这个字段的值可以从检索结果中得到。
        Tokenized：如果一个字段是Tokenized的，表示它是有经过Analyzer转变后成为一个tokens序列，在这个转变过程tokenization中，Analyzer提取出需要进行索引的文本，而剔除一些冗余的词句（例如：a，the,they等，详见org.apache.lucene.analysis.StopAnalyzer.ENGLISH_STOP_WORDS和org.apache.lucene.analysis.standard.StandardAnalyzer(String[] stopWords)的API）。Token是索引时候的基本单元，代表一个被索引的词，例如一个英文单词，或者一个汉字。因此，所有包含中文的文本都必须是Tokenized的。
     Field的几种接口：

Name	Stored	Indexed	Tokenized	use
Keyword(String name, String value)	Y	Y	N	date,url
Text(String name, Reader value)	N	Y	Y	short text fields: title,subject
Text(String name, String value)	Y	Y	Y	longer text fields, like “body”
UnIndexed(String name, String value)	Y	N	N
UnStored(String name, String value)	N	Y	Y

Hits与Searcher
Hits的主要使用接口：

接口名	备注
Doc(int n)	返回第n个的文档的所有字段
length()	返回这个集中的可用个数

3.2. 全文检索需求的实现

索引建立部分的代码：

private void createIndex(String indexFilePath) throws Exception{

IndexWriter iwriter=getWriter(indexFilePath);

Document doc=new Document();

doc.add(Field.Keyword("name","jerry"));

doc.add(Field.Text("sender","bluedavy@gmail.com"));

doc.add(Field.Text("receiver","google@gmail.com"));

doc.add(Field.Text("title","用于索引的标题"));

doc.add(Field.UnIndexed("content","不建立索引的内容"));

Document doc2=new Document();

doc2.add(Field.Keyword("name","jerry.lin"));

doc2.add(Field.Text("sender","bluedavy@hotmail.com"));

doc2.add(Field.Text("receiver","msn@hotmail.com"));

doc2.add(Field.Text("title","用于索引的第二个标题"));

doc2.add(Field.Text("content","建立索引的内容"));

iwriter.addDocument(doc);

iwriter.addDocument(doc2);

iwriter.optimize();

iwriter.close();

}

private IndexWriter getWriter(String indexFilePath) throws Exception{

boolean append=true;

File file=new File(indexFilePath+File.separator+"segments");

if(file.exists())

append=false;

return new IndexWriter(indexFilePath,analyzer,append);

}

3.2.1. 对于某字段的关键字的模糊查询

Query query=new WildcardQuery(new Term("sender","*davy*"));

Searcher searcher=new IndexSearcher(indexFilePath);

Hits hits=searcher.search(query);

for (int i = 0; i < hits.length(); i++) {

System.out.println(hits.doc(i).get("name"));

}

3.2.2. 对于某字段的关键字的语义查询

Query query=QueryParser.parse("索引","title",analyzer);

Searcher searcher=new IndexSearcher(indexFilePath);

Hits hits=searcher.search(query);

for (int i = 0; i < hits.length(); i++) {

System.out.println(hits.doc(i).get("name"));

}

3.2.3. 对于多字段的关键字的查询

Query query=MultiFieldQueryParser.parse("索引",new String[]{"title","content"},analyzer);

Searcher searcher=new IndexSearcher(indexFilePath);

Hits hits=searcher.search(query);

for (int i = 0; i < hits.length(); i++) {

System.out.println(hits.doc(i).get("name"));

}

3.2.4. 复合查询(多种查询条件的综合查询)

Query query=MultiFieldQueryParser.parse("索引",new String[]{"title","content"},analyzer);

Query mquery=new WildcardQuery(new Term("sender","bluedavy*"));

TermQuery tquery=new TermQuery(new Term("name","jerry"));

BooleanQuery bquery=new BooleanQuery();

bquery.add(query,true,false);

bquery.add(mquery,true,false);

bquery.add(tquery,true,false);

Searcher searcher=new IndexSearcher(indexFilePath);

Hits hits=searcher.search(bquery);

for (int i = 0; i < hits.length(); i++) {

System.out.println(hits.doc(i).get("name"));

}

四. 总结

相信大家通过上面的说明能知道Lucene的一个基本的使用方法，在全文检索时建议大家先采用语义时的搜索，先搜索出有意义的内容，之后再进行模糊之类的搜索，^_^，这个还是需要根据搜索的需求才能定了，Lucene还提供了很多其他更好用的方法，这个就等待大家在使用的过程中自己去进一步的摸索了，比如对于Lucene本身提供的Query的更熟练的掌握，对于Filter、Sorter的使用，自己扩展实现Analyzer，自己实现Query等等，甚至可以去了解一些关于搜索引擎的技术(切词、索引排序 etc)等等

Eric.Zhou 2007-01-28 10:38 发表评论