Dedian  
-- 关注搜索引擎的开发
日历
<2024年4月>
31123456
78910111213
14151617181920
21222324252627
2829301234
567891011
统计
  • 随笔 - 82
  • 文章 - 2
  • 评论 - 228
  • 引用 - 0

导航

常用链接

留言簿(7)

随笔分类(45)

随笔档案(82)

文章档案(2)

Java Spaces

搜索

  •  

积分与排名

  • 积分 - 64117
  • 排名 - 816

最新评论

阅读排行榜

评论排行榜

 

1. Getting the IP Address of a Hostname

    try 
{
InetAddress addr = InetAddress.getByName("yahoo.com");
byte[] ipAddr = addr.getAddress();

// Convert to dot representation
String ipAddrStr = "";
for (int i=0; i<ipAddr.length; i++) {
if (i > 0) {
ipAddrStr += ".";
}
ipAddrStr += ipAddr[i]&0xFF;
}
}
catch (UnknownHostException e) {
}


2. Getting the Hostname of an IP Address

This example attempts to retrieve the hostname for an IP address. Note that getHostName() may not succeed, in which case it simply returns the IP address.

try {
// Get hostname by textual representation of IP address
InetAddress addr = InetAddress.getByName("127.0.0.1");

// Get hostname by a byte array containing the IP address
byte[] ipAddr = new byte[]{127, 0, 0, 1};
addr = InetAddress.getByAddress(ipAddr);

// Get the host name
String hostname = addr.getHostName();

// Get canonical host name
String hostnameCanonical = addr.getCanonicalHostName();
} catch (UnknownHostException e) {
}

3. Getting the IP Address and Hostname of the Local Machine

    try {
InetAddress addr = InetAddress.getLocalHost();

// Get IP Address
byte[] ipAddr = addr.getAddress();

// Get hostname
String hostname = addr.getHostName();
} catch (UnknownHostException e) {
}

posted @ 2006-08-18 06:53 Dedian 阅读(547) | 评论 (0)编辑 收藏
 
http://forums.seochat.com/alexa-ranking-49/how-does-alexa-work-140.html
posted @ 2006-08-16 07:24 Dedian 阅读(301) | 评论 (1)编辑 收藏
 
In the last digest about Greatest software ever written, I noted a worm named Morris which is ranked 12 of greatest software by the author. Actually, after finishing my clustering searching enigne development which is based on Lucene, i am studying p2p architecture for my distributed searching engine (more precisely is webcrawler part). When I am reading some p2p loopup protocol papers such as Chord, I also noticed a guy named Morris who is one of the developers. Hmmm,  this is the same Morris, from wiki, I know that guys is now an associate professor in MIT, and was indicted because of the damage by his Morris worm. Anyway, I'd like to say that it is very interesting to know some stories about those geeks.
posted @ 2006-08-15 05:53 Dedian 阅读(435) | 评论 (0)编辑 收藏
 
http://www.informationweek.com/shared/printableArticle.jhtml?articleID=191901844

12. The Morris worm
11. Google search rank
10. Apollo guidance system
9. Excel spreadsheet
8. Macintosh OS
7. Sabre system
6. Mosaic browser
5. Java language
4. IBM System 360 OS
3.
gene-sequencing software at the Institute for Genomic Research
2. IBM's System R
1. Unix System III



How r u thinking?
posted @ 2006-08-15 02:22 Dedian 阅读(331) | 评论 (0)编辑 收藏
 
有兴趣的朋友可以参见原文

下面是本人的一些大致的翻译:
------------------------------------------------------------

大伙都知道,Google是运行在很多的Linux(GNU)系统的服务器上的,而这只是它支持免费软件的一个方面。其他的比如,Summer of Code, 现在已成为一个生产很多优秀代码和项目的孵化基地,并且最近开放的Code Repository, 大有取代sourceforge.net(笔者注:广大开源的据点)之趋势。一方面,Google贡献出它的Picasa(Linux(GNU)平台)(笔者注:一个图片管理软件),并被Wine(笔者注:Linux/Unix上的Windows,建于x-window之上)所使用;另一方面,Google也赞助一些开源项目,如Sri Lanka,大概有$25,000之多。
 
当然,Google也会秘密地进行一些开源的资助。比如,令我们大伙惊讶的Mozilla Foundation(笔者注:大家熟悉的另一浏览器Firefox)居然在去年有赚到72个million -- 就是在Firefox上把Google的搜索引擎作为缺省的搜索引擎。

2005年的1月份,Google把Ben Goodger招为靡下。此人乃Firefox的首席工程师,并且是几个主要开源编码者之一。到了年末,Guido van Rossum, Python的始创人,也加入了Google。最近,Linux2.6核心的维护人,Andrew Morton也宣称即将离开OSDL并投奔到Google.

所有的这些,都意味着开源领域的大变迁。

记得在最初的那些年代里,人们都为着自己的兴趣爱好在业余时间里一边工作一边学习地奋力地写着自己的代码。突然,第一个.com的时代来临,不少早期的开源公司开始聘请顶级程序员:如核心编码员Alan Cox, David Miller,Stephen Tweedie等人纷纷来到Red Hat, 还有一些去了Linuxcare。

随着第一个.com泡沫经济的破灭,高手们被迫纷纷寻找新的工作,不少人去了新兴之秀OSDL。基于这样的一个背景,Google的兴起以及大揽人才意味着早期公司广具人才的模式的回归。当然,这次他们的工作都间接的有关于Google的主要市场策略。

Google的策略是精明的,看看最近招的人,Goodger和Morton,一个是浏览器,一个是操作系统。无不显示出其与Microsoft暗暗较劲的决心。

当然还有另一方面的原因,可能不是那么明显,那就是最近的一些争论,关于Google能否履行其最初对开源领域许下的诺言。矛头指向Google是否应该公开它的源码?因为Google用了不少开源的东西。

所以,从某种角度上讲,招一些开源黑客人士入帐远远比把代码随处发布好的多。

那些关于用了开源的代码的公司是不是也应该开放他们的代码的争论不仅仅涉及到Google。其他的一些主要得益者如Yahoo, 其最近正活跃于收购一些Web 2.0的公司如FlickrDel.icio.us,这些都很显然有着开源的印记,当然它没有Google那样与开源的关系那么源远流长,不过Yahoo也开始着手吸引开源人才。
posted @ 2006-08-11 06:39 Dedian 阅读(899) | 评论 (0)编辑 收藏
 
People are still talking about web 2.0, I am not sure that is pure technical term. In my understanding, maybe most of meaning of web 2.0 is its marketing meaning. that is, web is becoming commonality and people generate the web's content. Again, i am not sure what is the place of web service in web 2.0, in my understanding, the web is not merely client-server marketing model (I am not talking web structure here), but an interactive community. But question is , who gonna be the operator or administrator of this community or if there are any game rules needed to follow ? will that be another utopian ?

Well, on a technical layer, I'd like to shed some lights on so-called web standard trends

1. front end --
         CSS ----> layout
         XML ----> data 
         XHTML ----> markup
         Javascript & DOM ----> behavior + XMLHttpRequest --> AJAX ?

2. back end -- 
         some open source projects such as Ruby on Rail...

let me know how you are thinking...

posted @ 2006-08-09 09:21 Dedian 阅读(801) | 评论 (0)编辑 收藏
 
作为LuceneNutch两大Apach Open Source Project的始创人(其实还有Lucy, Lucene4C 和Hadoop等相关子项目),Doug Cutting 一直为搜索引擎的开发人员所关注。他终于在为Yahoo以Contractor的身份工作4年后,于今年正式以Employee的身份加入Yahoo

下面是笔者在工作之余,翻译其一篇2年前的访谈录,原文(Doug Cutting Interview)在网上Google一下就容易找到。希望对搜索引擎开发的初学者起到一个抛砖引玉的效果。

(注:翻译水平有限,不求雅,只求信,达。希望见谅)

1。请问你以何为生?你是如何开始从事搜索引擎开发的?

我主要在家从事两个与搜索有关的开源项目的开发: Lucene和Nutch. 钱主要来自于一些与这些项目相关的一些合同中。目前Yahoo! Labs 有一部分赞助在Nutch上。这两个项目还有一些其他的短期合同 。

2。你能大概给我们讲解一下Nutch吗?以及你将在哪方面运用它?

我还是先说一下Lucene吧。Lucene其实是一个提供全文文本搜索的函数库,它不是一个应用软件。它提供很多API函数让你可以运用到各种实际应用程序中。现在,它已经成为Apache的一个项目并被广泛应用着。这里列出一些已经使用Lucene的系统

Nutch是一个建立在Lucene核心之上的Web搜索的实现,它是一个真正的应用程序。也就是说,你可以直接下载下来拿过来用。它在Lucene的基础上加了网络爬虫和一些和Web相关的东东。其目的就是想从一个简单的站内索引和搜索推广到全球网络的搜索上,就像Google和Yahoo一样。当然,和那些巨人竞争,你得动一些脑筋,想一些办法。我们已经测试过100M的网页,并且它的设计用在超过1B的网页上应该没有问题。当然,让它运行在一台机器上,搜索一些服务器,也运行的很好。

3。在你看来,什么是搜索引擎的核心元素?也就说,一般的搜索引擎软件可以分成哪几个主要部分或者模块?

让我想想,大概是如下几块吧:

 -- 攫取(fetching):就是把被指向的网页下载下来。
 -- 数据库:保存攫取的网页信息,比如那些网页已经被攫取,什么时候被攫取的以及他们又有哪些链接的网页等等。
 -- 链接分析:对刚才数据库的信息进行分析,给每个网页加上一些权值(比如PageRank,WebRank什么的),以便对每个网页的重要性有所估计。不过,在我看来,索引那些网页标记(Anchor)里面的内容更为重要。(这也是为什么诸如Google Bombing如此高效的原因)
 -- 索引(Indexing): 就是对攫取的网页内容,以及链入链接,链接分析权值等信息进行索引以便迅速查询。
 -- 搜索(Searching): 就是通过一个索引进行查询然后按照网页排名显示。

当然,为了让搜索引擎能够处理数以亿计的网页,以上的模块都应该是分布式的。也就是说,可以在多台机器上并行运行。

4。你刚才说大家可以立马下载Nutch运行在自己的机器上。这是不是说,即便那些对Apache服务器没有掌控权的网站管理员在短时间内就可以使用Nutch?

很不幸,估计他们大都没戏。因为Nutch还是需要一个Java servlet的容器(笔者注:比如Tomcat)。而这个有些ISP支持,但大都不支持。(笔者注: 只有对Apache服务器有掌控权,你才能在上面安装一个Tomcat之类的东东)

5。我可以把Lucene和Google Web API结合起来吗?或者和其他的一些我先前写过的应用程序结合起来?

有那么一帮人已经为Nutch写了一些类似Google的API, 但还没有一个融入现在的系统。估计不久的将来就行了。

6。你认为目前实现一个搜索引擎最大的障碍在哪里?是硬件,存储障碍还是排名算法?还有,你能不能告诉我大概需要多大的空间搜索引擎才能正常工作,就说我只想写一个针对搜索成千上百万的RSS feeds的一个搜索引擎吧。

Nutch大概一个网页总共需要10kb的空间吧。Rss feeds的网页一般都比较小(笔者注: Rss feeds都是基于xml的文本网页,所以不会很大),所以应该更好处理吧。当然Nutch目前还没有针对RSS的支持。(笔者注:实际上,API里面有针对RSS的数据结构和解析)

7。从Yahoo! Labs拿到资金容易吗?哪些人可以申请?你又要为之做出些什么作为回报?

我是被邀请的,我没有申请。所以我不是很清楚个中的流程。

8。Google有没有表示对Nutch感兴趣?

我和那边的一些家伙谈过,包括Larry Page(笔者注: Google两个创始人之一)。他们都很愿意提供一些帮助,但是他们也无法找到一种不会帮助到他们竞争对手的合适方式。

9。你有实现你自己的PageRank或者WebRank算法系统在你的Nutch里吗?什么是你做网页排名(Ranking)的考虑?

是的,Nutch里面有一个链接分析模块。它是可选的,因为对于站内搜索来说,网页排名是不需要的。

10。我想你以前有听说过,就是对于一个开源的搜索引擎,是不是意味着同样会给那些搞搜索引擎优化(SEO)的黑客们有机可趁?

恩,有可能。
就说利用反向工程破解的非开源搜索引擎中的最新的反垃圾信息检测算法需要大概6个月的时间。对于一个开放源码的搜索引擎来说,破解将会更快。但不管怎么说,那些制造垃圾信息者最终总能找到破解办法,唯一的区别就是破解速度问题。所以最好的反垃圾信息技术,不管开源也好闭源也好,就是让别人知道了其中的机制之后也能继续工作那一种。

还有,如果这六月中你是把检测出来的垃圾信息从你的索引中移除,他们无计可施,他们只能改变他们的站点。如果你的垃圾信息检测是基于对一些网站中好的和坏的例子的统计分析,你可以彻夜留意那些新的垃圾信息模式并在他们有机会反应之前将他们移除。

开源会使得禁止垃圾信息的任务稍稍艰巨一点,但不是使之成为不可能。况且,那些闭源的搜索引擎也并没有秘密地解决这些问题。我想闭源的好处就是不让我们看到它其实没有我们想象的那么好。

11。Nutch和分布式的网络爬虫Grub相比怎么样?你是怎么想这个问题的?

我能说的就是,Grub是一个能够让网民们贡献一点自己的硬件和带宽给巨大的LookSmart的爬行任务的一个工程。它只有客户端是开源,而服务端没有。所以大家并不能配置自己的Grub服务,也不能访问到Grub收集的数据。

更一般意义的分布式网络爬行又如何?当一个搜索引擎变得很大的时候,其爬行上的代价相对搜索上需要付出的代价将是小巫见大巫。所以,一个分布式爬虫并不能是显著降低成本,相反它会使得一些已经不是很昂贵的东西变得很复杂(笔者注:指pc和硬盘之类的硬件)。所以这不是一个便宜的买卖。

广泛的分布式搜索是一件很有趣的事,但我不能肯定它能否实现并保持速度足够的快。一个更快的搜索引擎就是一个更好的搜索引擎。当大家可以任意快速更改查询的时候,他们就更能在他们失去耐心之前频繁找到他们所需的东西。但是,要建立一个不到1秒内就可以搜索数以亿计的网页的广泛的分布式搜索引擎是很难的一件事,因为其中网络有很高的延时。大都的半秒时间或者像Google展示它的查询那样就是在一个数据中心的网络延时。如果你让同样一个系统运行在千家万户的家里的PC上,即便他们用的是DSL和Cable上网,网络的延时将会更高从而使得一个查询很可能要花上几秒钟甚至更长的时间。从而他也不可能会是一个好的搜索引擎。

12。你反复强调速度对于搜索引擎的重要性,我经常很迷惑Google怎么就能这么快地返回查询结果。你认为他们是怎么做到的呢?还有你在Nutch上的经验看法如何?

我相信Google的原理和Nutch大抵相同:就是把查询请求广播到一些节点上,每个节点返回一些页面的顶级查询结果。每个节点上保存着几百万的页面,这样可以避免大多查询的磁盘访问,并且每个节点可以每秒同时处理成十上百的查询。如果你想获得数以亿计的页面,你可以把查询广播到成千的节点上。当然这里会有不少网络流量。

具体的在这篇文章www.computer.org/ micro/mi2003/ m2022.pdf)中有所描述。

13。你刚才有提到垃圾信息,在Nutch里面是不是也有类似的算法?怎么区别垃圾信息模式比如链接场(Linkfarms)(笔者注:就是一群的网页彼此互相链接,这是当初在1999年被一帮搞SEO弄出来的针对lnktomi搜索引擎的使网页的排名得到提高的一种Spamdexing方法)和那些正常的受欢迎的站点链接。

这个,我们还没有腾出时间做这块。不过,很显然这是一个很重要的领域。在我们进入链接场之前,我们需要做一些简单的事情:察看词汇填充(Word stuffing)(笔者注:就是在网页里嵌入一些特殊的词汇,并且出现很多的次,甚至上百次,有些是人眼看不到的,比如白板写白字等伎俩,这也是Spamdexing方法的一种),白板写白字(White-on-white text),等等。

我想在一般意义上来说(垃圾信息检测是其中的一个子问题),搜索质量的关键在于拥有一个对查询结果手工可靠评估的辅助措施。这样,我们可以训练一个排名算法从而产生更好的查询结果(垃圾信息的查询结果是一种坏的查询结果)。商业的搜索引擎往往会雇佣一些人进行可靠评估。Nutch也会这样做,但很显然我们不能只接受那些友情赞助的评估,因为那些垃圾信息制造者很容易会防止那些评估。因此我们需要一种手段去建立一套自愿评估者的信任体制。我认为一个平等评论系统(peer-review system),有点像Slashdot的karma系统, 应该在这里很有帮助。

14。你认为搜索引擎在不久的将来路在何方?你认为从一个开发者的角度来看,最大的障碍将在哪里?

很抱歉,我不是一个想象力丰富的人。我的预测就是在未来的十年里web搜索引擎将和现在的搜索引擎相差无几。现在应该属于平稳期。在最初的几年里,网络搜索引擎确实曾经发展非常迅速。源于1994年的网络爬虫使用了标准的信息析取方法。直到1998年Google的出现,其间更多的基于Web的方法得到了发展。从那以后,新方法的引入大大放慢了脚步。那些树枝低的果实已被收获。创新只有在刚发展的时候比较容易,越到后来越成熟,越不容易创新。网络搜索引擎起源于上个世纪90年代,现在俨然已成一颗摇钱树,将来很快会走进人们的日常生活中。

至于开发上的挑战,我认为操作上的可靠性将是一个大的挑战。我们目前正在开发一个类似GFS(Google的文件系统)的东西。它是巨型搜索引擎不可缺少的基石:你不能让一个小组件的错误导致一个大的瘫痪。你应该很容易的让系统扩展,只需往硬件池里加更多硬件而不需繁缛的重新配置。还有,你不需要一大坨的操作人员完成,所有的一切将大都自己搞定。

----------------完----------------------
posted @ 2006-08-02 06:07 Dedian 阅读(14352) | 评论 (199)编辑 收藏
 

--  Getting Ready to Use CVS

First set the variable CVSROOT to /class/`username`/cvsroot
[Or any other directory you wish]
[For csh/tcsh: setenv CVSROOT ~/cvsroot]
[For bash/ksh: CVSROOT=~/cvsroot;export CVSROOT]

Next run cvsinit. It will create this directory along with the subdirectory CVSROOT and put several files into CVSROOT.

-- How to put a project under CVS

A simple program consisting of multiple files is in /workspaces/project.

To put this program under cvs first

cd to /workspaces/project

Next

cvs import -m "Sample Program" project sample start

CVS should respond with
N project/Makefile
N project/main.c
N project/bar.c
N project/foo.c

No conflicts created by this import


If your were importing your own program, you could now delete the original source.
(Of course, keeping a backup is always a good idea)

-- Basic CVS Usage

Now that you have added 'project' to your CVS repository, you will want to be able to modify the code.

To do this you want to check out the source. You will want to cd to your home directory before you do this.

cd

cvs checkout project

CVS should respond with
cvs checkout: Updating project
U project/Makefile
U project/bar.c
U project/foo.c
U project/main.c



This creates the project directory in your home directory and puts the files: Makefile, bar.c, foo.c, and main.c into the directory along with a CVS directory which stores some information about the files.

You can now make changes to any of the files in the source tree.
Lets say you add a printf("DONE\n"); after the function call to bar()
[Or just cp /class/bfennema/project_other/main2.c to main.c]

Now you have to check in the new copy

cvs commit -m "Added a DONE message." main.c

CVS should respond with
Checking in main.c;
/class/'username'/cvsroot/project/main.c,v <-- main.c
new revision: 1.2; previous revision: 1.1
done


Note, the -m option lets you define the checking message on the command line. If you omit it you will be placed into an editor where you can type in the checking message.

-- Using CVS with Multiple Developers

To simulate multiple developers, first create a directory for your second developer.
Call it devel2 (Create it in your home directory).
Next check out another copy of project.
  • HINT: cvs checkout project
Next, in the devel2/project directory, add a printf("YOU\n"); after the printf("BAR\n");
[Or copy /class/bfennema/project_other/bar2.c to bar.c]

Next, check in bar.c as developer two.
  • HINT: cvs commit -m "Added a YOU" bar.c
Now, go back to the original developer directory.
[Probably /class/'username'/project]

Now look at bar.c. As you can see, the change made by developer one has no been integrated into your version. For that to happen you must

cvs update bar.c

CVS should respond with
U bar.c

Now look at bar.c. It should now be the same as developer two's.
Next, edit foo.c as the original developer and add printf("YOU\n"); after the printf("FOO\n");
[Or copy /class/bfennema/project_other/foo2.c to foo.c]

Then check in foo.c

  • HINT: cvs commit -m "Added YOU" foo.c
Next, cd back to developer two's directory.
Add printf("TOO\n"); after the printf("FOO\n");
[Or copy /class/bfennema/project_other/foo3.c to foo.c]

Now type

cvs status foo.c

CVS should respond with
===================================================================
File: foo.c             Status: Needs Merge

   Working revision:    1.1.1.1 'Some Date'
   Repository revision: 1.2     /class/'username'/cvsroot/project/foo.c,v
   Sticky Tag:          (none)
   Sticky Date:         (none)
   Sticky Options:      (none)
The various status of a file are:
Up-to-date
    The file is identical with the latest revision in the repository.
Locally Modified
    You have edited the file, and not yet committed your changes.
Needing Patch
    Someone else has committed a newer revision to the repository.
Needs Merge
    Someone else has committed a newer revision to the repository, and you have also made modifications to the file.

Therefore, this is telling use we need to merge our changes with the changes made by developer one. To do this

cvs update foo.c

CVS should respond with
RCS file: /class/'username'/cvsroot/project/foo.c,v
retrieving revision 1.1.1.1
retrieving revision 1.2
Merging differences between 1.1.1.1 and 1.2 into foo.c
rcsmerge: warning: conflicts during merge
cvs update: conflicts found in foo.c
C foo.c


Since the changes we made to each version were so close together, we must manually adjust foo.c to look the way we want it to look. Looking at foo.c we see:
void foo()
{
  printf("FOO\n");
<<<<<<< foo.c
  printf("TOO\n");
=======
  printf("YOU\n");
>>>>>>> 1.2
}

We see that the text we added as developer one is between the ======= and the >>>>>>> 1.2.
The text we just added is between the ======= and the <<<<<<< foo.c

To fix this, move the printf("TOO\n");to after the printf("YOU\n");line and delete the additional lines the CVS inserted. [Or copy /class/bfennema/project_other/foo4.c to foo.c]
Next, commit foo.c

cvs commit -m "Added TOO" foo.c

Since you issued a cvs update command and integrated the changes made by developer one, the integrated changes are committed to the source tree.

-- Additional CVS Commands

To add a new file to a module:
  • Get a working copy of the module.
  • Create the new file inside your working copy.
  • use cvs add filename to tell CVS to version control the file.
  • use cvs commit filename to check in the file to the repository.

Removing files from a module:
  • Make sure you haven't made any uncommitted modifications to the file.
  • Remove the file from the working copy of the module. rm filename.
  • use cvs remove filename to tell CVS you want to delete the file.
  • use cvs commit filename to actually perform the removal from the repository.

For more information see the cvs man pages or the cvs.ps file in cvs-1.7/doc.

---------------
copy from http://www.csc.calpoly.edu/~dbutler/tutorials/winter96/cvs/
posted @ 2006-07-20 07:06 Dedian 阅读(497) | 评论 (0)编辑 收藏
 
reference:

http://java.sun.com/j2se/1.4.2/docs/guide/util/logging/overview.html
posted @ 2006-06-27 02:49 Dedian 阅读(268) | 评论 (0)编辑 收藏
 

When reading GData source code, you will find that there are lots of generic-style code in it, which is one of several extensions of JDK 1.5. If you are using java 1.5 compiler, it is surely deserved to get some ideas about generic. Be noticed that Java generic looks like C++ Temple, but is quite different.

1. what is the idea of generic?
To simply say, generic is an idea of parameterizing type, including class type and other data types.

2. examples?
-- We are familar with some container types, such as Collection. Here is an example for our former (Java 1.4 or before) typical usage:
Vector myList = new Vector();
myList.add(new Integer(100));
Integer value = (Integer)myList.get(0);

now it is better to write like this for type safety: (Eclipse IDE will display type safety warnings for above code if under java 1.5 compiler option)
  Vector<Integer> myList = new Vector<Integer>();
  myList.add(new Integer(100));
  Integer value = myList.get(0);

-- the reason why write code like this is Class Vector has been defined as a generic:
public Class Vector<E>
{
      void add(E x);
      ......
}

-- when we see some angle brackets(invocations) shown in declaration, that is a generic. The invocation is a parameterized type. to use this generic, we need specify an actual type argument. (such as Integer as above)

3. trick in generic

-- we know that the idea of generic makes some data type such as container more flexible or acceptable for inputting entries. But that will be also very tricky. To take container as an example of generic, one of tricks is can we copy values from one container to another container? if you want to copy like following style, the answer is no.
List<String> ls = new ArrayList<String>();
List<Object> lo = ls; //compile time error!

-- though we know String is a subtype of Object, and we can assign a value of String to an Object. But we can not assign a List of String to a List of Object as a whole part(like reference to a variable). The reason is we can access inner part of List(I mean element here, if List is as a simple data type such as Object, maybe we can do that), that will make List type unsafe. So, Java 1.5 complier will not let you do that.

-- Look inside two styles of code in above examples(of 2), we might say that the older style looks more flexible, because myList can accept more data types besides Integer, but the new style in 1.5 can only take Integer values. Well, if we need more flexible, we apply wildcards for generic.

4. Wildcards and bounded wildcards

-- if we see something like Collection<?> c, there is a question mark in angle brackets. That is Wildcard, which means type is temporarily unknown but it will be replaced by any type.
-- if we see something like Collection<? extends Number> c, that is bounded wildcard, which means the elements in Collection has a supertype bound. You can not put any other type whose supertype is not Number into this Collection.
-- But, no matter wildcard or bounded wildcard, we can not put a specified type value in it, that is because wildcard means type is unknown, you can not give a value to unknown data type.
-- So, what hell can wildcard be used for ? return back the flexible idea we mentioned before. We need apply wildcard to describe a flexible idea in definition or declaration, not to do real things.
for example, we can define an method like this:
void printCollection(Collection<?> c)
{
      for(Object e : c){System.out.println(e);}
}
see? that is flexible. You can call this function for any Collection. You can use elements in Collection<?>, just don't try to put something in it.
-- So the question is, if we wanna that flexibility for our method, and we also need put something in it during the subroutine. How can we do? and then, we need use generic method

5. Generic method
-- that means method declaration can also be parameterized.
-- example:
    public <T> void addCollection(List<T> objs, T obj)
   {
        objs.add(obj);
    }

6. when to use generic method and when to use wildcard ?
-- if the type parameter is used only once, or it has no relationship to other arguments of method including the return type, then wildcard is better to use to decribe clearer and more concise meanings.
-- otherwise, generic method should be used.
example:
class Collection
{
      public static <T, S extends T> void copy(List<T> dest, List<S> src){...}
}
can be better rewritten as :
class Collection
{
      public static <T> void copy(List<T> dest, List<? extends T> src){...}
}

reference: http://java.sun.com/j2se/1.5/pdf/generics-tutorial.pdf

posted @ 2006-06-23 09:39 Dedian 阅读(1381) | 评论 (0)编辑 收藏
 
http://dsonline.computer.org/portal/site/dsonline/menuitem.9ed3d9924aeb0dcd82ccc6716bbe36ec/index.jsp?&pName=dso_level1&path=dsonline/0507&file=w4sta.xml&xsl=article.xsl&;jsessionid=GZQWvln9z4JY2dXX8HyQ5f5KtRptqHRWvh17tjCXVbxHnGyzvTm2!554406865
posted @ 2006-06-22 06:06 Dedian 阅读(205) | 评论 (0)编辑 收藏
 

http://java.sun.com/j2se/1.5.0/docs/guide/language/index.html

posted @ 2006-06-21 09:51 Dedian 阅读(195) | 评论 (0)编辑 收藏
 
when I try to debug my webcrawler by crawling yahoo website, I found that when trying to connect to a website which URL is such as http://www.youtube.com/w/Kak%E1?v=PIBe_V9PBIA&search=kak%C3%A1, the following exception will happen:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 12
 at java.lang.String.substring(Unknown Source)
 at sun.net.www.ParseUtil.unescape(Unknown Source)
 at sun.net.www.ParseUtil.decode(Unknown Source)
 at sun.net.www.ParseUtil.toURI(Unknown Source)
 at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
 at sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)

follow is simple testing code:
 
private static final String urlstring = "http://www.youtube.com/w/Kak%E1?v=PIBe_V9PBIA&search=kak%C3%A1";

   URL url = new URL(urlstring);
   
   URLConnection con = url.openConnection();
   
   con.connect();

since there are no other explicit exceptions except MalformedURLException & IOException mentioned to catch for this code, I am not sure if it is a bug in Java for URL parsing...

anybody got some idea about that?

P.S. ok, somebody has pointed out that Runtime exceptions, like java.lang.StringIndexOutOfBoundsException, do not have to be declared, but they can be thrown. So i need catch StringIndexOutOfBoundsException this exception for my code. But in my understanding, the function should catch all the exceptions from lower functions, and then throw out if it can not handle them, thus we can catch those exception from deep functions. I am not sure Runtime exceptions are exceptional ...
posted @ 2006-06-15 07:48 Dedian 阅读(498) | 评论 (0)编辑 收藏
 
Still working on Webcrawler part, the URL collection strategies are under thinking. A URL frontier which stores the list of  activate URLs to be parsed or downloaded will be applied to handle for synchonized I/O operations with URL collection/Inventory, stuck by some issues:

1. Duplicate URL Elimination:
    a. Host name aliases --> DNS Resolver
    b. Omitted port numbers
    c. Alternative paths on the same host
    d. replication across difference host
    e. non-sense links or session IDs embedded in URLs ?
2. Reachable of URL
3. Distributed Storage of URL Inventory and relative synchronization problem
4. Fetch strategies for URL Frontier or Fetchor to get activate links for parsing
5. Scheduler for fetching and updating URL collection: multi-thread or single thread on each pc, when to decide re-parsing a page
7. URL-Seen test: if that page has been parsed and should it re-parse? which should be done before entering URL frontier...
8. Extensibility issues for those modules: Fetcher, Extractor/Filters, Collector...
9. Checkpointing for crawlering interupted: how to resume the crawler job, how to split crawler jobs and distribute to different machines

seems that I need couple days to refine my systen architecture design...
posted @ 2006-06-09 08:57 Dedian 阅读(841) | 评论 (0)编辑 收藏
 
Here is an article for effective I/O programming thought, mark it just for future re-check my I/O design in distributed searching engine system. Non-blocking synchronous mode was applied in my current system. I need check it out if anything can do to improve the performance and large scalability later.


posted @ 2006-06-09 08:56 Dedian 阅读(197) | 评论 (0)编辑 收藏
仅列出标题
共6页: 上一页 1 2 3 4 5 6 下一页 
 
Copyright © Dedian Powered by: 博客园 模板提供:沪江博客