备忘：lucene中的ranking算法

说明见Similarity.java的javadoc信息：

算法请参考javadoc的，它使用的是Vector Space Model (VSM) of Information Retrieval。

针对一条查询语句q(query)，一个d(document)的得分公式

score(q,d) = coord(q,d) · queryNorm(q) ·	∑	( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) )
	t in q

其中，

tf(t in d) 表示某个term的出现频率，定义了term t出现在当前地document d的次数。那些query中给定地term，如果出现越多次的，得分越高。它在默认实现DefaultSimilarity的公式为

tf(t in d) = frequency^½

idf(t) 表示反向文档频率。这个参数表示docFreq(term t一共在多少个文档中出现)的反向影响值。它意味着在越少文档中出现的terms贡献越高地分数。它在默认实现DefaultSimilarity的公式为:

idf(t) =

1 + log (

numDocs

–––––––––

docFreq+1

)

coord(q,d) 是一个基于在该文档中出现了多少个query中的terms的得分因素。文档中出现的query中的terms数量/query总共多少个query数量。典型的，一个文档包含越多地query中的terms会得到更高地分。This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
queryNorm(q) 是一个标准化参数，它是用来区分比较不同queries时的因素，这个因素不影响document ranking (因为所有的ranked document都会乘以相同的值)，但是不同地queries（或这不同地indexes中）它会得到不同的可用于比较的值。This is a search time factor computed by the Similarity in effect at search time. 它在默认实现DefaultSimilarity的公式为:

queryNorm(q) = queryNorm(sumOfSquaredWeights) =

––––––––––––––

sumOfSquaredWeights^½

其中的sumOfSquaredWeights(of the query terms)是根据the query Weight object计算出来的. For example, a boolean query computes this value as:

`sumOfSquaredWeights` = `q.getBoost()` ² ·	∑	( idf(t) · t.getBoost() ) ²
	t in q

t.getBoost() 是一个term t在query q中的search time boost，它是在the query text (see query syntax)中指定的, 或者被应用程序直接调用setBoost()设置的. 注意，这儿没有直接的API去访问在 a multi term query的一个term的boost值，但是multi terms会以multi TermQuery objects在一个query中被表示,因此the boost of a term in the query可以使用子query的getBoost()反问到.

norm(t,d) 封装(encapsulates)了一些(indexing time)的boost和length factors: ???这个参数之和field中tokens的数量有关，和term本身无关???

Document boost - set by calling doc.setBoost() before adding the document to the index.

Field boost - set by calling field.setBoost() before adding the field to a document.

lengthNorm(field) -。当文档被加入到索引时计算，，和document的field中的tokens的数量有关，因此，一个比较短的fields贡献更高的分数。LengthNorm is computed by the Similarity class in effect at indexing. DefaultSimilarity中的实现为(float)(1.0 / Math.sqrt(numTerms));

当一个文档被加入索引时，上述因素会被相乘。如果文档有多个fields同名，他们的boosts数值会被多次相乘。

norm(t,d) = `doc.getBoost()` · `lengthNorm(field)` ·	∏	`f.getBoost`()
	field f in d named as t

但是，计算出的norm数值在存储时是使用一个a single byte编码的。search时，这个norm byte从index directory读取，并且被解码回float。这个编码/解码算法会产生精度丢失。 - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75. Also notice that search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.

发表于 2008-02-09 17:58 鹏飞万里阅读(1867) 评论(0) 编辑收藏

导航

统计

常用链接

留言簿(4)

我参与的团队

随笔档案

搜索

最新评论

阅读排行榜

评论排行榜