| 
	
	
		
			1, Lucene的结构框架:  
注意:Lucene中的一些比较复杂的词法分析是用JavaCC生成的(JavaCC:JavaCompilerCompiler,纯Java的词法
分析生成器),所以如果从源代码编译或需要修改其中的QueryParser、定制自己的词法分析器,还需要从https://javacc.dev.java.net/ 下载javacc。  
lucene的组成结构:对于外部应用来说索引模块(index)和检索模块(search)是主要的外部应用入口。 org.apache.Lucene.search/ 搜索入口  
org.apache.Lucene.index/ 索引入口  
org.apache.Lucene.analysis/ 语言分析器  
org.apache.Lucene.queryParser/ 查询分析器  
org.apache.Lucene.document/ 存储结构  
org.apache.Lucene.store/ 底层IO/存储结构  
org.apache.Lucene.util/ 一些公用的数据结构 
 
2, 关于计划于词库的分词和一元分词,二元分词的区别. noise.chs 是词库中作为stopword而存在的.请大家注意.  
下面做了详细描述: 
 
2006年01月22日 星期日 于 2:39 am · 发表在: 默认 
 
Lucene应用越来越多,在对中文对索引过程中,中文分词问题也就越来越重要。 
 
在已有的分词模式中,目前比较常用的也是比较通用的有一元分词、二元分词和基于词库的分词三种。一元分词在Java版本上由yysun实现,并且已经收录
到Apache。其实现方式比较简单,即将每一个汉字作为一个Token,例如:“这是中文字”,在经过一元分词模式分词后的结果为五个Token:这、
是、中、文、字。而二元分词,则将两个相连的汉字作为一个Token划分,例如:“这是中文字”,运用二元分词模式分词后,得到的结果为:这是、是中、中
文、文字。 
 
一元分词和二元分词实现原理比较简单,基本支持所有东方语言。但二者的缺陷也比较明显。一元分词单纯的考虑了中文的文字而没有考虑到中文的词性,例如在上
述的例子中,“中文”、“文字”这两个十分明显的中文词语就没有被识别出来。相反,二元分词则分出了太多的冗余的中文词,如上所述,“这是”、“是中”毫
无意义的文字组合竟被划分为一个词语,而同样的缺陷,命中的词语也不十分准确,如上:在“这是中文字”中,“中文字”这个词语应该优先考虑的。而二元分词
也未能实现。 
 
基于词库的分词实现难度比较大,其模式也有多种,如微软在自己的软件中的汉语分词、海量的中文分词研究版,还有目前在.Net下实现的使用率较高的猎兔,
和一些其他人自发实现的分词工具等等。其都有自己的分析体系,虽然分析精度高,但实现难度大,实现周期长,而且,对一般的中小型应用系统来讲,在精度的要
求不是十分苛刻的环境下,这种模式对系统对消耗是一种奢侈行为。 
 
在综合考虑一元分词、二元分词及基于词库的分词模式后,我大胆提出一种基于StopWord分割的分词模式。这种分词模式的设计思想是,针对要分割的段
落,先由标点分割成标准的短句。然后根据设定的StopWord,将短句由StopWord最大化分割,分割为一个个词语。如:输入短句为“这是中文字
”,设定的StopWord列表为:“这”、“是”,则最终的结果为:“中文字”。 
 
这个例子相对比较简单,举个稍微长一点的例子:输入短句“中文软件需要具有对中文文本的输入、显示、编辑、输出等基本功能”,设定的StopWord列表为:“这”、“是”、“的”、“对”、“等”、“需要”、“具有”,则分割出对结果列表为: 
 
====================  
中文软件  
中文文本  
输入  
显示  
编辑  
输出  
基本功能  
==================== 
 
基本实现了想要的结果,但其中也不乏不足之处,如上述的结果中“中文软件”与“中文文本”应该分割为三个独立词“中文”、“软件”和“文本”,而不是上述的结果。 
 
并且,对StopWord列表对设置,也是相对比较复杂的环节,没有一个确定的约束来设定StopWord。我的想法是,可以将一些无意义的主语,如“我
”、“你”、“他”、“我们”、“他们”等,动词“是”、“对”、“有”等等其他各种词性诸如“的”、“啊”、“一”、“不”、“在”、“人”等等
(System32目录下noise.chs文件里的内容可以作为参考)作为StopWord。 
 
noise.chs 是词库中作为stopword而存在的.请大家注意. 
 
3, 关于分词的.还可以关注这个帖子: 
http://lucene-group.group.javaeye.com/group/blog/58701 
自己写的一个基于词库的lucene分词程序--ThesaurusAnalyzer 
 
我已经测试过.还可以.18万分词. 
 
4, lucene的自带分词的测试如下:\ 
 
Lucene本身提供了几个分词接口,我后来有给写了一个分词接口. 
 
功能递增如下: 
 
WhitespaceAnalyzer:仅仅是去除空格,对字符没有lowcase化,不支持中文 
 
SimpleAnalyzer:功能强于WhitespaceAnalyzer,将除去letter之外的符号全部过滤掉,并且将所有的字符lowcase化,不支持中文 
 
StopAnalyzer:StopAnalyzer的功能超越了SimpleAnalyzer,在SimpleAnalyzer的基础上  
增加了去除StopWords的功能,不支持中文 
 
StandardAnalyzer:英文的处理能力同于StopAnalyzer.支持中文采用的方法为单字切分. 
 
ChineseAnalyzer:来自于Lucene的sand box.性能类似于StandardAnalyzer,缺点是不支持中英文混和分词. 
 
CJKAnalyzer:chedong写的CJKAnalyzer的功能在英文处理上的功能和StandardAnalyzer相同  
但是在汉语的分词上,不能过滤掉标点符号,即使用二元切分 
 
TjuChineseAnalyzer:我写的,功能最为强大.TjuChineseAnlyzer的功能相当强大,在中文分词方面由于其调用的为
ICTCLAS的java接口.所以其在中文方面性能上同与ICTCLAS.其在英文分词上采用了Lucene的StopAnalyzer,可以去除
stopWords,而且可以不区分大小写,过滤掉各类标点符号. 
 
程序调试于:JBuilder 2005 
 
package org.apache.lucene.analysis; 
 
//Author:zhangbufeng  
//TjuAILab(天津大学人工智能实验室)  
//2005.9.22.11:00 
 
import java.io.*;  
import junit.framework.*; 
 
import org.apache.lucene.*;  
import org.apache.lucene.analysis.*;  
import org.apache.lucene.analysis.StopAnalyzer;  
import org.apache.lucene.analysis.standard.*;  
import org.apache.lucene.analysis.cn.*;  
import org.apache.lucene.analysis.cjk.*;  
import org.apache.lucene.analysis.tjucn.*;  
import com.xjt.nlp.word.*;  
public class TestAnalyzers extends TestCase { 
 
public TestAnalyzers(String name) {  
super(name);  
} 
 
public void assertAnalyzesTo(Analyzer a,  
String input,  
String[] output) throws Exception {  
//前面的"dummy"好像没有用到  
TokenStream ts = a.tokenStream("dummy", new StringReader(input));  
StringReader readerInput=new StringReader(input);  
for (int i=0; i Token t = ts.next();  
//System.out.println(t);  
assertNotNull(t);  
//使用下面这条语句即可以输出Token的每项的text,并且用空格分开  
System.out.print(t.termText);  
System.out.print(" ");  
assertEquals(t.termText(), output); }
 System.out.println(" ");
 assertNull(ts.next());
 ts.close();
 }
 public void outputAnalyzer(Analyzer a ,String input) throws Exception{
 TokenStream ts = a.tokenStream("dummy",new StringReader(input));
 StringReader readerInput = new StringReader(input);
 while(true){
 Token t = ts.next();
 if(t!=null){
 System.out.print(t.termText);
 System.out.print(" ");
 }
 else
 break;
 
 }
 System.out.println(" ");
 ts.close();
 }
 
 public void testSimpleAnalyzer() throws Exception {
 //学习使用SimpleAnalyzer();
 //SimpleAnalyzer将除去letter之外的符号全部过滤掉,并且将所有的字符lowcase化
 Analyzer a = new SimpleAnalyzer();
 assertAnalyzesTo(a, "foo bar FOO BAR",
 new String[] { "foo", "bar", "foo", "bar" });
 assertAnalyzesTo(a, "foo bar . FOO <> BAR",
 new String[] { "foo", "bar", "foo", "bar" });
 assertAnalyzesTo(a, "foo.bar.FOO.BAR",
 new String[] { "foo", "bar", "foo", "bar" });
 assertAnalyzesTo(a, "U.S.A.",
 new String[] { "u", "s", "a" });
 assertAnalyzesTo(a, "C++",
 new String[] { "c" });
 assertAnalyzesTo(a, "B2B",
 new String[] { "b", "b" });
 assertAnalyzesTo(a, "2B",
 new String[] { "b" });
 assertAnalyzesTo(a, "\"QUOTED\" word",
 new String[] { "quoted", "word" });
 assertAnalyzesTo(a,"zhang ./ bu <> feng",
 new String[]{"zhang","bu","feng"});
 ICTCLAS splitWord = new ICTCLAS();
 String result = splitWord.paragraphProcess("我爱大家 i LOVE chanchan");
 assertAnalyzesTo(a,result,
 new String[]{"我","爱","大家","i","love","chanchan"});
 
 }
 
 public void testWhiteSpaceAnalyzer() throws Exception {
 //WhiterspaceAnalyzer仅仅是去除空格,对字符没有lowcase化
 Analyzer a = new WhitespaceAnalyzer();
 assertAnalyzesTo(a, "foo bar FOO BAR",
 new String[] { "foo", "bar", "FOO", "BAR" });
 assertAnalyzesTo(a, "foo bar . FOO <> BAR",
 new String[] { "foo", "bar", ".", "FOO", "<>", "BAR" });
 assertAnalyzesTo(a, "foo.bar.FOO.BAR",
 new String[] { "foo.bar.FOO.BAR" });
 assertAnalyzesTo(a, "U.S.A.",
 new String[] { "U.S.A." });
 assertAnalyzesTo(a, "C++",
 new String[] { "C++" });
 
 assertAnalyzesTo(a, "B2B",
 new String[] { "B2B" });
 assertAnalyzesTo(a, "2B",
 new String[] { "2B" });
 assertAnalyzesTo(a, "\"QUOTED\" word",
 new String[] { "\"QUOTED\"", "word" });
 
 assertAnalyzesTo(a,"zhang bu feng",
 new String []{"zhang","bu","feng"});
 ICTCLAS splitWord = new ICTCLAS();
 String result = splitWord.paragraphProcess("我爱大家 i love chanchan");
 assertAnalyzesTo(a,result,
 new String[]{"我","爱","大家","i","love","chanchan"});
 }
 
 public void testStopAnalyzer() throws Exception {
 //StopAnalyzer的功能超越了SimpleAnalyzer,在SimpleAnalyzer的基础上
 //增加了去除StopWords的功能
 Analyzer a = new StopAnalyzer();
 assertAnalyzesTo(a, "foo bar FOO BAR",
 new String[] { "foo", "bar", "foo", "bar" });
 assertAnalyzesTo(a, "foo a bar such FOO THESE BAR",
 new String[] { "foo", "bar", "foo", "bar" });
 assertAnalyzesTo(a,"foo ./ a bar such ,./<> FOO THESE BAR ",
 new String[]{"foo","bar","foo","bar"});
 ICTCLAS splitWord = new ICTCLAS();
 String result = splitWord.paragraphProcess("我爱大家 i Love chanchan such");
 assertAnalyzesTo(a,result,
 new String[]{"我","爱","大家","i","love","chanchan"});
 
 }
 public void testStandardAnalyzer() throws Exception{
 //StandardAnalyzer的功能最为强大,对于中文采用的为单字切分
 Analyzer a = new StandardAnalyzer();
 assertAnalyzesTo(a,"foo bar Foo Bar",
 new String[]{"foo","bar","foo","bar"});
 assertAnalyzesTo(a,"foo bar ./ Foo ./ BAR",
 new String[]{"foo","bar","foo","bar"});
 assertAnalyzesTo(a,"foo ./ a bar such ,./<> FOO THESE BAR ",
 new String[]{"foo","bar","foo","bar"});
 assertAnalyzesTo(a,"张步峰是天大学生",
 new String[]{"张","步","峰","是","天","大","学","生"});
 //验证去除英文的标点符号
 assertAnalyzesTo(a,"张,/步/,峰,.是.,天大<>学生",
 new String[]{"张","步","峰","是","天","大","学","生"});
 //验证去除中文的标点符号
 assertAnalyzesTo(a,"张。、步。、峰是。天大。学生",
 new String[]{"张","步","峰","是","天","大","学","生"});
 }
 public void testChineseAnalyzer() throws Exception{
 //可见ChineseAnalyzer在功能上和standardAnalyzer的功能差不多,但是可能在速度上慢于StandardAnalyzer
 Analyzer a = new ChineseAnalyzer();
 
 //去空格
 assertAnalyzesTo(a,"foo bar Foo Bar",
 new String[]{"foo","bar","foo","bar"});
 assertAnalyzesTo(a,"foo bar ./ Foo ./ BAR",
 new String[]{"foo","bar","foo","bar"});
 assertAnalyzesTo(a,"foo ./ a bar such ,./<> FOO THESE BAR ",
 new String[]{"foo","bar","foo","bar"});
 assertAnalyzesTo(a,"张步峰是天大学生",
 new String[]{"张","步","峰","是","天","大","学","生"});
 //验证去除英文的标点符号
 assertAnalyzesTo(a,"张,/步/,峰,.是.,天大<>学生",
 new String[]{"张","步","峰","是","天","大","学","生"});
 //验证去除中文的标点符号
 assertAnalyzesTo(a,"张。、步。、峰是。天大。学生",
 new String[]{"张","步","峰","是","天","大","学","生"});
 //不支持中英文写在一起
 // assertAnalyzesTo(a,"我爱你 i love chanchan",
 /// new String[]{"我","爱","你","i","love","chanchan"});
 
 }
 public void testCJKAnalyzer() throws Exception {
 //chedong写的CJKAnalyzer的功能在英文处理上的功能和StandardAnalyzer相同
 //但是在汉语的分词上,不能过滤掉标点符号,即使用二元切分
 Analyzer a = new CJKAnalyzer();
 assertAnalyzesTo(a,"foo bar Foo Bar",
 new String[]{"foo","bar","foo","bar"});
 assertAnalyzesTo(a,"foo bar ./ Foo ./ BAR",
 new String[]{"foo","bar","foo","bar"});
 assertAnalyzesTo(a,"foo ./ a bar such ,./<> FOO THESE BAR ",
 new String[]{"foo","bar","foo","bar"});
 
 // assertAnalyzesTo(a,"张,/步/,峰,.是.,天大<>学生",
 // new String[]{"张步","步峰","峰是","是天","天大","大学","学生"});
 //assertAnalyzesTo(a,"张。、步。、峰是。天大。学生",
 // new String[]{"张步","步峰","峰是","是天","天大","大学","学生"});
 //支持中英文同时写
 assertAnalyzesTo(a,"张步峰是天大学生 i love",
 new String[]{"张步","步峰","峰是","是天","天大","大学","学生","i","love"});
 
 }
 public void testTjuChineseAnalyzer() throws Exception{
 /**
 * TjuChineseAnlyzer的功能相当强大,在中文分词方面由于其调用的为ICTCLAS的java接口.
 * 所以其在中文方面性能上同与ICTCLAS.其在英文分词上采用了Lucene的StopAnalyzer,可以去除
 * stopWords,而且可以不区分大小写,过滤掉各类标点符号.
 */
 Analyzer a = new TjuChineseAnalyzer();
 String input = "体育讯 在被尤文淘汰之后,皇马主帅博斯克拒绝接受媒体对球队后防线的批评,同时还为自己排出的首发阵容进行了辩护。"+
 "“失利是全队的责任,而不仅仅是后防线该受指责,”博斯克说,“我并不认为我们踢得一塌糊涂。”“我们进入了半决赛,而且在晋级的道路上一路奋 "+
 "战。即使是今天的比赛我们也有几个翻身的机会,但我们面对的对手非常强大,他们踢得非常好。”“我们的球迷应该为过去几个赛季里我们在冠军杯中的表现感到骄傲。”"+
 "博斯克还说。对于博斯克在首发中排出了久疏战阵的坎比亚索,赛后有记者提出了质疑,认为完全应该将队内的另一 "+
 "名球员帕文派遣上场以加强后卫线。对于这一疑议,博斯克拒绝承担所谓的“责任”,认为球队的首发没有问题。“我们按照整个赛季以来的方式做了,"+
 "对于人员上的变化我没有什么可说的。”对于球队在本赛季的前景,博斯克表示皇马还有西甲联赛的冠军作为目标。“皇家马德里在冠军 "+
 "杯中战斗到了最后,我们在联赛中也将这么做。”"+
 "A Java User Group is a group of people who share a common interest in
Java technology and meet on a regular basis to share"+
 " technical ideas and information. The actual structure of a JUG can
vary greatly - from a small number of friends and coworkers"+
 " meeting informally in the evening, to a large group of companies based in the same geographic area. "+
 "Regardless of the size and focus of a particular JUG, the sense of community spirit remains the same. ";
 
 outputAnalyzer(a,input);
 //此处我已经对大文本进行过测试,不会有问题效果很好
 outputAnalyzer(a,"我爱大家 ,,。 I love China 我喜欢唱歌 ");
 assertAnalyzesTo(a,"我爱大家 ,,。I love China 我喜欢唱歌",
 new String[]{"爱","大家","i","love","china","喜欢","唱歌"});
 }
 } 
ExtJS教程
 -Hibernate教程 -Struts2 教程 -Lucene教程     
	    
    
 |