tim-wu - 语源科技BlogJava

Gem库是怎么被Ruby定位和加载的?

ruby.exe提供了一个参数-r, 允许ruby在允许之前加载你指定的库
1 如果你安装了gem，那么环境变量RUBYOPT将为-rubygems，这个参数说明了ruby将提前加载ubygem.rb(注意，没有r，不是rubygem.rb，而是ubygem:)
2 这时，如果你运行 ruby -e "puts $:"，可以查看到ruby查询lib库的目录顺序，其中第一个就是类似"..\ruby\site_ruby\1.8"目录
3 因此，ubygem.rb将在ruby\site_ruby\1.8\ubygems.rb位置中被ruby定位到，而ubygem.rb只有一句话require 'rubygems'，这次才真正调用了rubygems.rb
4 接着rubygems.rb的最后一句require 'rubygems/custom_require'将加载custom_require.rb
5 最后custom_require.rb中替换了原先的require()函数的实现，这之后，库的加载，将遵循gem的目录约定。

posted @ 2007-12-21 11:48 鹏飞万里阅读(753) | 评论 (0) | 编辑收藏

Mysql 6开始提供对4-byte utf8的支持，试验未全部完成....

Mysql 6.0开始对4字节utf8提供支持，全面遵循RFC 3629规范。
见： http://dev.mysql.com/doc/refman/6.0/en/mysql-nutshell.html
很可惜，我没有试验成功，难道还只是"are expected to be added to MySQL 6.0: "?
而且，按目前的mysql开发日志中http://forge.mysql.com/worklog/task.php?id=1213写到

Version:	Server-6.0
Status:	In-Documentation
Priority:	Low
Description:	Pushed to 6.0.4 on Nov 27/2007.

估计6.0.4版本该功能正式提供。

从bug trace上看，5.2.6时mysql就已经支持过4byte utf8,utf32 utf16，http://bugs.mysql.com/search.php?search_for=utf32&status%5B%5D=Active&severity=&limit=30&order_by=&cmd=display&phpver=&os=0&os_details=&bug_age=0&tags=&similar=&target=&defect_class=all&workaround_viability=all&impact=all&fix_risk=all&fix_effort=all
不过我不知道如何才能下载到5.2.6? 估计要从配置库下载了自行编译吧。另外，6.0.4alpha的源代码配置库中也已经有了。

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝
目前版本可以使用一个暂时的解决方法，就是5.1.3 connectors提供的useBlobToStoreUTF8OutsideBMP功能，功能倒是完整，但必须把字段类型设置为blob，性能自然很值得质疑.

String url = "jdbc:mysql://localhost:3306/u?useUnicode=true&characterEncoding=utf8&useBlobToStoreUTF8OutsideBMP=true&utf8OutsideBmpIncludedColumnNamePattern=a";

connection = DriverManager.getConnection(url, username, password);

Statement stmt = connection.createStatement();

ResultSet rs = stmt.executeQuery("select * from t where a like '你%'");

while (rs.next()) {

// String name = new String(rs.getBytes("a"), "UTF-8");

String name = rs.getString("a");

System.out.println(name);

}

stmt.close();

connection.close

试过
insert t values(0xF0A38D98);
select hex(a) from t，数据存取均正常。

posted @ 2007-12-07 11:11 鹏飞万里阅读(1492) | 评论 (0) | 编辑收藏

重读范式，其实4NF和5NF我们还是常常遵循的？

记得上学时学数据库，书中说过：“数据库建模一般实现到3NF和BCNF，4NF 5NF基本没用”，造成多年对4NF和5NF置之不理。
离开学校7年后的今天，无事有把数据库的范式定义拿起来翻翻，发觉好象我对4NF和5NF的理解一直有误？

在做业务建模时，不少情况我们会尽可能地完全反映显示业务关系，以厂采购员和订单为例：

如果业务认为订单不属于特定采购员，也即关系如下：
厂和订单为1：n的关系；
厂和采购员也为1：n的关系；
采购员与订单无关。
此时，那么我们肯定ORM简单得处理为两个关联表，这时，不正是符合了4NF么？（如果只建立一个关联表，表中三个字段厂id,订单id，采购员id，而后三个id组成联合主键，那就是符合了BCNF但不符合4NF了）

而如果关系更复杂，一个订单可以被多个采购员处理，一个订单还可以同属于多个厂共享，那么我们一定是建立三个关联表，独立记录三者间互相的关系，这不就是遵循了5NF么？

posted @ 2007-11-21 15:21 鹏飞万里阅读(1462) | 评论 (0) | 编辑收藏

Google Code就是比Souceforge好！

今天注册了个Google Code的项目，还是google code爽啊，够简单，
http://code.google.com/hosting/createProject 注册一下就成功，不用得着什么审核
svn速度爆快，相比sf.net就是个牛车。

其它没试，不评论，这两点就足够了。

不知道rubyforge怎么样。

posted @ 2007-11-20 23:27 鹏飞万里阅读(383) | 评论 (0) | 编辑收藏

lucene疑问：为什么RAMFile添加内容时，如果不在目录内，不需要修改sizeInBytes值？

final byte[] addBuffer(int size) {

byte[] buffer = new byte[size];

if (directory!=null)

synchronized (directory) { // Ensure addition of buffer and adjustment to directory size are atomic wrt directory

buffers.add(buffer);

directory.sizeInBytes += size;

sizeInBytes += size;

}

else

buffers.add(buffer);

return buffer;

}

posted @ 2007-09-21 14:53 鹏飞万里阅读(246) | 评论 (0) | 编辑收藏

lucene疑问：为什么修改文件的最后修改时间，需要等待1毫秒?

类RAMDirectory中：

public void touchFile(String name) throws IOException {

RAMFile file;

synchronized (this) {

file = (RAMFile)fileMap.get(name);

}

if (file==null)

throw new FileNotFoundException(name);

long ts2, ts1 = System.currentTimeMillis();

do {

try {

Thread.sleep(0, 1);

} catch (InterruptedException e) {}

ts2 = System.currentTimeMillis();

} while(ts1 == ts2);

file.setLastModified(ts2);

}

posted @ 2007-09-21 14:35 鹏飞万里阅读(285) | 评论 (0) | 编辑收藏

lucene之节省磁盘空间的BitVector

org.apache.lucene.util.BitVector

这个小小的工具类用来保存bit数据，并且提供bit级别的boolean读写能力。
这部分能力类似java.util.BitSet

而有趣的地方在于BitVector提供了持久化的能力（保存到文件），
为节省磁盘空间:

if (isSparse()) {

writeDgaps(output); // sparse bit-set more efficiently saved as

// d-gaps.

} else {

writeBits(output);

}

首先判断数据中是否bit值为1的数据非常少，如果是，就采用Dgaps算法，
这种算法将压缩数据，结构如下
..
III(IB)+

第1个Int为标记位，值-1表示为Dgaps
第2个Int为数据长度（多少个bit）
第3个Int为数据有多少个bit值为1

而后循环，只保存Byte值非0的串
第1个Int保存和上一个便宜
第2个位为Byte，保存Byte值(也就是8位bit)

呵呵，看来做indexer真是啥都省着用啊

btw，判断数据中是否bit值为1的数据非常少的算法页很有趣

private boolean isSparse() {

// note: order of comparisons below set to favor smaller values (no

// binary range search.)

// note: adding 4 because we start with ((int) -1) to indicate d-gaps

// format.

// note: we write the d-gap for the byte number, and the byte (bits[i])

// itself, therefore

// multiplying count by (8+8) or (8+16) or (8+24) etc.:

// - first 8 for writing bits[i] (1 byte vs. 1 bit), and

// - second part for writing the byte-number d-gap as vint.

// note: factor is for read/write of byte-arrays being faster than

// vints.

int factor = 10;

if (bits.length < (1 << 7))

return factor * (4 + (8 + 8) * count()) < size();

if (bits.length < (1 << 14))

return factor * (4 + (8 + 16) * count()) < size();

if (bits.length < (1 << 21))

return factor * (4 + (8 + 24) * count()) < size();

if (bits.length < (1 << 28))

return factor * (4 + (8 + 32) * count()) < size();

return factor * (4 + (8 + 40) * count()) < size();

}

posted @ 2007-09-21 12:04 鹏飞万里阅读(499) | 评论 (0) | 编辑收藏

备忘：两篇关于crawler的经典论文

http://www.cse.iitb.ac.in/~soumen/doc/www1999f/html/

http://www.cindoc.csic.es /cybermetrics/pdf/68.pdf

posted @ 2007-09-21 11:13 鹏飞万里阅读(250) | 评论 (0) | 编辑收藏

Google@2000年

今天读了http://infolab.stanford.edu/~backrub/google.html 一文，发现我毕业那年（2000年）google已有如此成果。

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation.
最酷的就是这句话了，真酷的algorithm啊，我想破头想不出头绪来达到这个效率。

呵呵,从wikipedia开始，发现很有意思的文章越来越多:)
http://en.wikipedia.org/wiki/HITS_algorithm

posted @ 2007-09-13 15:18 鹏飞万里阅读(406) | 评论 (0) | 编辑收藏

备忘: UTF-8的格式

http://www.google.com/search?hl=zh-CN&q=UTF-8+0x7FF&btnG=Google+%E6%90%9C%E7%B4%A2&lr=

java中一个char对应一个int，长度为32bit
而utf-8保存一个char时，长度为1-3个byte，也就是8bit-24bit。
其中code<= 0x7F的，保存为1个byte
(code >= 0x80) && (code <= 0x7FF)的，保存为2个byte
code>0x800的，保存为3个byte

因此lucene中,IndexOutput.writeChars()函数的代码为

public void writeChars(String s, int start, int length) throws IOException {

final int end = start + length;

for (int i = start; i < end; i++) {

final int code = (int) s.charAt(i);

if (code >= 0x01 && code <= 0x7F)

writeByte((byte) code);

else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {

writeByte((byte) (0xC0 | (code >> 6)));

writeByte((byte) (0x80 | (code & 0x3F)));

} else {

writeByte((byte) (0xE0 | (code >>> 12)));

writeByte((byte) (0x80 | ((code >> 6) & 0x3F)));

writeByte((byte) (0x80 | (code & 0x3F)));

}

posted @ 2007-09-12 17:19 鹏飞万里阅读(798) | 评论 (3) | 编辑收藏

导航

统计

常用链接

留言簿(4)

我参与的团队

随笔档案

搜索

最新评论

阅读排行榜

评论排行榜