2008年1月26日随笔档案 - tim-wu

反向索引：

正向索引（草稿，不完全，因为收到field info的影响，不同的field存储内容不同，且fieldInfo的有些信息,TOKENIZED BINARY COMPRESSED也是保存在.fdt的每个document相关段的bits中,而不是.fnm中）:

posted @ 2008-02-27 18:14 鹏飞万里阅读(1601) | 评论 (0) | 编辑收藏

Lucene和GCJ

Lucene在1.9版本的时候就已经加入了对GCJ的支持，利用GCJ编译Lucene，并且使用新的GCJIndexInput.java读写文件系统，
直接调用操作系统级别的native方法，相信读写性能能够极大得提高啊。

具体代码可见Lucene的gcj目录，编译使用ant gcj

posted @ 2008-02-14 15:27 鹏飞万里阅读(464) | 评论 (0) | 编辑收藏

备忘：lucene中的ranking算法

说明见Similarity.java的javadoc信息：

算法请参考javadoc的，它使用的是Vector Space Model (VSM) of Information Retrieval。

针对一条查询语句q(query)，一个d(document)的得分公式

score(q,d) = coord(q,d) · queryNorm(q) ·	∑	( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) )
	t in q

其中，

tf(t in d) 表示某个term的出现频率，定义了term t出现在当前地document d的次数。那些query中给定地term，如果出现越多次的，得分越高。它在默认实现DefaultSimilarity的公式为

tf(t in d) = frequency^½

idf(t) 表示反向文档频率。这个参数表示docFreq(term t一共在多少个文档中出现)的反向影响值。它意味着在越少文档中出现的terms贡献越高地分数。它在默认实现DefaultSimilarity的公式为:

idf(t) =

1 + log (

numDocs

–––––––––

docFreq+1

)

coord(q,d) 是一个基于在该文档中出现了多少个query中的terms的得分因素。文档中出现的query中的terms数量/query总共多少个query数量。典型的，一个文档包含越多地query中的terms会得到更高地分。This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
queryNorm(q) 是一个标准化参数，它是用来区分比较不同queries时的因素，这个因素不影响document ranking (因为所有的ranked document都会乘以相同的值)，但是不同地queries（或这不同地indexes中）它会得到不同的可用于比较的值。This is a search time factor computed by the Similarity in effect at search time. 它在默认实现DefaultSimilarity的公式为:

queryNorm(q) = queryNorm(sumOfSquaredWeights) =

––––––––––––––

sumOfSquaredWeights^½

其中的sumOfSquaredWeights(of the query terms)是根据the query Weight object计算出来的. For example, a boolean query computes this value as:

`sumOfSquaredWeights` = `q.getBoost()` ² ·	∑	( idf(t) · t.getBoost() ) ²
	t in q

t.getBoost() 是一个term t在query q中的search time boost，它是在the query text (see query syntax)中指定的, 或者被应用程序直接调用setBoost()设置的. 注意，这儿没有直接的API去访问在 a multi term query的一个term的boost值，但是multi terms会以multi TermQuery objects在一个query中被表示,因此the boost of a term in the query可以使用子query的getBoost()反问到.

norm(t,d) 封装(encapsulates)了一些(indexing time)的boost和length factors: ???这个参数之和field中tokens的数量有关，和term本身无关???

Document boost - set by calling doc.setBoost() before adding the document to the index.

Field boost - set by calling field.setBoost() before adding the field to a document.

lengthNorm(field) -。当文档被加入到索引时计算，，和document的field中的tokens的数量有关，因此，一个比较短的fields贡献更高的分数。LengthNorm is computed by the Similarity class in effect at indexing. DefaultSimilarity中的实现为(float)(1.0 / Math.sqrt(numTerms));

当一个文档被加入索引时，上述因素会被相乘。如果文档有多个fields同名，他们的boosts数值会被多次相乘。

norm(t,d) = `doc.getBoost()` · `lengthNorm(field)` ·	∏	`f.getBoost`()
	field f in d named as t

但是，计算出的norm数值在存储时是使用一个a single byte编码的。search时，这个norm byte从index directory读取，并且被解码回float。这个编码/解码算法会产生精度丢失。 - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75. Also notice that search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.

posted @ 2008-02-09 17:58 鹏飞万里阅读(1879) | 评论 (0) | 编辑收藏

Lucene如何控制segments的数量?

Lucene的索引文件，会包含很多个segments文件，每个segment中包含多个documents文件，一个segment中会有完整的正向索引和反向索引。
在搜索时，Lucene会遍历这些segments，以segments为基本单位独立搜索每个segments文件，而后再把搜索结果合并。

建立索引文件的过程，实际就是把documents文件一个个加入索引中，Lucene的做法是最开始为每个新加入的document独立生成一个segment，放在内存中。而后，当内存中segments数量到达一个阙值时，合并这些segments，新生成一个segment加入文件系统的segments列表中。
而当文件系统的segments数量过多时，势必影响搜索效率，因此需要不断合并这些segments文件。

那么Lucene的合并策略是什么？如何保证合适的segments数量呢？

其实Lucene有两套基本的策略：
第一种策略实现代码位于IndexWriter#optimize()函数，其实就是把所有segments文件合并成一个文件。合并的算法是递归合并列表最后的mergeFactor个segments文件直到合并成一个文件。这儿mergeFactor是Lucene的一个参数。
btw: 从算法细节上看，其实我不是喜欢这段实现，因为列表的最后mergeFactor个文件内容实际被扫描了segmens_count/mergeFactor次。如果分段归纳合并的方式不知道是否更好，每个segment文件内容都将被扫描 ceil(Log_mergeFactor(segmens_count)) 或ceil(Log_mergeFactor(segmens_count)) +1次，是否更好?

第二种策略实现代码位于IndexWriter#maybeMergeSegments()函数中，这个代码就复杂了，它的基本策略就是要求确保合并后两个公式的成立:
假定segments是个有序列表，B表示maxBufferedDocs，f(n)=ceil(log_M(ceil(n/B)))，M表示mergeFactor，这儿maxBufferedDocs和mergeFactor是两个参数
1. 如果第i个segment的documents数量为x，第i+1个segment的documents数量为y，那么f(x)>f(y)一定成立
2. f(n)值相同的segments的数量不得超过M。
那么maybeMergeSegments()函数是如何确保这两个公式成立的呢?
1.首先，从代码：

    protected final void maybeFlushRamSegments() throws IOException {
        // A flush is triggered if enough new documents are buffered or
        // if enough delete terms are buffered
        if (ramSegmentInfos.size() >= minMergeDocs
                || numBufferedDeleteTerms >= maxBufferedDeleteTerms) {
            flushRamSegments();
        }
    }

这儿minMergeDocs=maxBufferedDocs, 因此可以看出，当内存中缓存的segments被合并写回磁盘时生成的segment的document count等于或小于maxBufferedDocs（即minMergeDocs）。
补充：因为maybeMergeSegments()运行在同步代码中，因此只要ramSegmentInfos.size==minMergerDocs(即maxBufferedDocs)就会写回磁盘，因此应该不存在ramSegmentInfos.size>maxBufferedDocs才写回的情况。而且，但如果是这种情况，因为合并后的segment文件的document count过大，后面的maybeMergeSegments()将不做合并处理直接退出，上述公式就可能不成立，那么算法将有错。
----
2.
2.1 因此maybeMergeSegments()第一次执行时，所有segments的document count都小于等于maxBufferedDocs。此时，从i=0开始，合并i~i+mergeFactor-1个文件，如果合并后的doc count>maxBufferedDocs时，保留第i个segment，否则继续合并改变后的i~i+mergeFactor-1，直到doc count>maxBufferedDocs或所有segments文件个数已经<mergeFactor了。经过这样一轮的合并，除了最后<mergeFactor个的doc counts<=maxBufferedDocs文件外，其它文件的doc counts一定都>maxBufferedDocs并<maxBufferedDocs*mergeFactor了。
2.2 这时，不理会最后<mergeFactor个doc count<maxBufferedDocs的文件，而后按2.1的同理规则，合并之前的文件，让这些文件的最后<mergerFactor个segment符合 maxBufferedDocs<doc counts<=maxBufferedDocs*mergeFactor，之前的segment文件都符合maxBufferedDocs*mergeFactor<doc counts<=maxBufferedDocs*mergeFactor^2
2.3 重复2.2，最后得到的列表就会满足上述两等式的成立
---
3
之后，每次从内存缓存中flush出一个新的segment时，也就是往这个segments列表的最后增加一个doc_count<=maxBufferedDocs的文件，同样执行上述步骤2进行合并，能够始终保证上述两公式的成立。
----
4
4.1
IndexWriter#addIndexesNoOptimize同样借鉴使用了maybeMergeSegments()算法，区别此时，实际是已有一个符合两公式的segments序列T，在T之后追加上随意顺序的segments序列S。这时，我们先找到S中doc count值最大的那个segment，计算出它属于的区间f(x)，使得maxBufferedDocs*mergeFactor^x<doc counts<=maxBufferedDocs*mergeFactor^(x+1)，而后按2.2的算法合并出除了最后<mergerFactor个segments外，之前所有segments都符合 a. doc count>maxBufferedDocs*mergeFactor^(x+1) b.符合上述2等式。
btw: 因为这儿调用IndexWriter#addIndexesNoOptimize传入的参数是maxBufferedDocs*mergeFactor^(x+1)，因为S所有segment的doc count都一定小于maxBufferedDocs*mergeFactor^(x+1)，因此S的所有元素都会参与收缩合并。
4.2 将最后<mergerFactor个doc count<maxBufferedDocs*mergeFactor^(x+1)的segments合并，如果合并后的segment的doc count大于maxBufferedDocs*mergeFactor^(x+1)，就继续2.2步骤，使得整个队列符合上述两公式
-----

上述两种策略，最终确保了Lucene中的segments不会太多,确保效率。

BTW：实际上，如果documents太多时，Lucene还支持把docuements分成几个组，每个组用独立的进程或电脑进行索引，而后再多个目录的索引合并起来，具体可参考IndexWriter#addIndexesNoOptimize和IndexWriter#addIndexes函数

posted @ 2008-02-06 01:58 鹏飞万里阅读(4705) | 评论 (3) | 编辑收藏

备忘：lucene的一些enum类型

FieldSelectorResult:枚举，分别为
        LOAD Document#getFieldable和Document#getField不会返回null
        LAZY_LOAD :Lazy的Field意味着在搜索结果里这个Field的值缺省是不读取的，只有当你真正对这个Field取值的时候才会去取。所以如果你要对它取值，你得保证IndexReader还没有close。 Document#getField不能使用，只能使用Document#getFieldable
        NO_LOAD Document#getField和Document#getFieldable都返回null，Document#add不被调用。
        LOAD_AND_BREAK 类似LOAD，Document#getField和Document#getFieldable都可用，但返回后就结束，Document可能没有完整的field的Set，参考LoadFirstFieldSelector 。
        LOAD_FOR_MERGE 类似LOAD，但不压缩任何数据。只被SegmentMerger的一个FieldSelector匿名内嵌实现类使用。Document#getField和Document#getFieldable可返回null.
        SIZE 返回Field的size而不是value. Size表示存储这个field需要的bytes數，string数值使用2*chars。size被存储为a binary value，表现为as an int in a byte[]，with the higher order byte first in [0]。
        SIZE_AND_BREAK 类似SIZE，但立刻break from the field loading loop, i.e. stop loading further fields, after the size is loaded

======================================

Field中三大enum: Store Index和TermVector：

       ------------------------------------
        Store.COMPRESS Store the original field value in the index in a compressed form. This is useful for long documents and for binary valued fields.压缩存储；
        Store.YES Store the original field value in the index. This is useful for short texts like a document's title which should be displayed with the results. The value is stored in its original form, i.e. no analyzer is used before it is stored. 索引文件本来只存储索引数据, 此设计将原文内容直接也存储在索引文件中，如文档的标题。
        Store.NO Do not store the field value in the index. 原文不存储在索引文件中，搜索结果命中后，再根据其他附加属性如文件的Path，数据库的主键等，重新连接打开原文，适合原文内容较大的情况。
        决定了Field对象的 this.isStored 和        this.isCompressed
     ------------------------------------
        Index.NO Do not index the field value. This field can thus not be searched, but one can still access its contents provided it is Field.Store stored. 不进行索引，存放不能被搜索的内容如文档的一些附加属性如文档类型, URL等。
        Index.TOKENIZED Index the field's value so it can be searched. An Analyzer will be used to tokenize and possibly further normalize the text before its terms will be stored in the index. This is useful for common text. 分词索引
        Index.UN_TOKENIZED Index the field's value without using an Analyzer, so it can be searched. As no analyzer is used the value will be stored as a single term. This is useful for unique Ids like product numbers. 不分词进行索引，如作者名，日期等，Rod Johnson本身为一单词，不再需要分词。

        Index.NO_NORMS 不分词，建索引。norms是什么???字段值???。但是Field的值不像通常那样被保存，而是只取一个byte，这样节约存储空间???? Index the field's value without an Analyzer, and disable the storing of norms. No norms means that index-time boosting and field length normalization will be disabled. The benefit is less memory usage as norms take up one byte per indexed field for every document in the index.Note that once you index a given field <i>with</i> norms enabled, disabling norms will have no effect. In other words, for NO_NORMS to have the above described effect on a field, all instances of that field must be indexed with NO_NORMS from the beginning.
        决定了Field对象的 this.isIndexed this.isTokenized this.omitNorms
     ------------------------------------
        Lucene 1.4.3新增的：
        TermVector.NO Do not store term vectors. 不保存term vectors
        TermVector.YES Store the term vectors of each document. A term vector is a list of the document's terms and their number of occurences in that document. 保存term vectors。
        TermVector.WITH_POSITIONS Store the term vector + token position information 保存term vectors。（保存值和token位置信息）
        TermVector.WITH_OFFSETS Store the term vector + Token offset information
        TermVector.WITH_POSITIONS_OFFSETS Store the term vector + Token position and offset information 保存term vectors。（保存值和Token的offset）
        决定了Field对象的this.storeTermVector this.storePositionWithTermVector this.storeOffsetWithTermVector

posted @ 2008-01-29 14:02 鹏飞万里阅读(907) | 评论 (0) | 编辑收藏

Java7 VB2008都开始支持Lambda(Closure)了

Closure: http://en.wikipedia.org/wiki/Closure_%28computer_science%29
我还比较喜欢Microsoft的一段说明，位于链接http://msdn.microsoft.com/msdnmag/issues/07/09/BasicInstincts/Default.aspx?loc=zh中，查找“Lambda 表达式和变量提升”

最早接触Closure是在学javascript, 前年还写了篇关于Closure对javascript内存泄露的文章http://www.blogjava.net/tim-wu/archive/2006/05/29/48729.html
一直以为这就是函数式语言的特性，顶多就是.net的委托和它有几分相识，
没想到现在Java7也要支持了，有兴趣的朋友可以去读读：
http://www.javac.info/
没细读，不知道Lambda在Java这种强类型检查的语言中表现的如何。

作为函数语言，Ruby中一直都有closure的用法，http://samdanielson.com/2007/3/19/proc-new-vs-lambda-in-ruby有个简单例子。

def foo

f = Proc.new { return "return from foo from inside proc" }

f.call # control leaves foo here

return "return from foo"

end

def bar

f = lambda { return "return from lambda" }

f.call # control does not leave bar here

return "return from bar"

end

puts foo # prints "return from foo from inside proc"

puts bar # prints "return from bar"

最近，ruby 1.9又提供了新的定义lambda

x = ->{puts "Hello Lambda"}

参见 http://www.infoq.com/cn/news/2008/01/new-lambda-syntax

VB 2008也支持了，这个链接有个例子，又是Lambda，又是范型，，又是委托回调，很有意思：
http://msdn.microsoft.com/msdnmag/issues/07/09/BasicInstincts/Default.aspx?loc=zh
VB中“需要注意的一点限制是，lambda 表达式完全就是一个单个表达式。在 Visual Basic 2008 中，您在 lambda 表达式中只能有一个单个表达式。在本专栏中，我将进一步向您展示 Visual Basic 2008 中引入的一个新的三元运算符，它将允许您构造条件表达式，但目前的功能不支持在 lambda 表达式中使用任意语句。”

posted @ 2008-01-29 12:58 鹏飞万里阅读(2383) | 评论 (1) | 编辑收藏

备忘:lucene的几种常用Analyzer

以下内容均为转载,url见具体链接:

最常见的四个Analyzer,说明: http://windshowzbf.bokee.com/3016397.html
WhitespaceAnalyzer 仅仅是去除空格，对字符没有lowcase化,不支持中文
SimpleAnalyzer :功能强于WhitespaceAnalyzer,将除去letter之外的符号全部过滤掉,并且将所有的字符lowcase化,不支持中文
StopAnalyzer: StopAnalyzer的功能超越了SimpleAnalyzer，在SimpleAnalyzer的基础上.增加了去除StopWords的功能,不支持中文.类中使用一个static数组保存了ENGLISH_STOP_WORDS, 太常见不index的words
StandardAnalyzer: 用Javacc定义的一套EBNF，严禁的语法。有人说英文的处理能力同于StopAnalyzer.支持中文采用的方法为单字切分。未仔细比较，不敢确定。

其他的扩展:
ChineseAnalyzer:来自于Lucene的sand box.性能类似于StandardAnalyzer,缺点是不支持中英文混和分词.
CJKAnalyzer:chedong写的CJKAnalyzer的功能在英文处理上的功能和StandardAnalyzer相同.但是在汉语的分词上，不能过滤掉标点符号，即使用二元切分
TjuChineseAnalyzer: http://windshowzbf.bokee.com/3016397.html写的,功能最为强大.TjuChineseAnlyzer的功能相当强大,在中文分词方面由于其调用的为ICTCLAS的java接口.所以其在中文方面性能上同与ICTCLAS.其在英文分词上采用了Lucene的StopAnalyzer,可以去除 stopWords,而且可以不区分大小写,过滤掉各类标点符号.

例子:
http://www.langtech.org.cn/index.php/uid-5080-action-viewspace-itemid-68, 还有简单的代码分析

Analyzing "The quick brown fox jumped over the lazy dogs"

WhitespaceAnalyzer:

[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

SimpleAnalyzer:

[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

StopAnalyzer:

[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

StandardAnalyzer:

[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

Analyzing "XY&Z Corporation - xyz@example.com"

WhitespaceAnalyzer:

[XY&Z] [Corporation] [-] [xyz@example.com]

SimpleAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

StandardAnalyzer:

[xy&z] [corporation] [xyz@example.com]

参考连接:
http://macrochen.blogdriver.com/macrochen/1167942.html
http://macrochen.blogdriver.com/macrochen/1153507.html

http://my.dmresearch.net/bbs/viewthread.php?tid=8318
http://windshowzbf.bokee.com/3016397.html

posted @ 2008-01-26 02:03 鹏飞万里阅读(1287) | 评论 (1) | 编辑收藏

2008年1月26日

导航

统计

常用链接

留言簿(4)

我参与的团队

随笔档案

搜索

最新评论

阅读排行榜

评论排行榜