[备份from blueline]weblech阅读心得20050501（URL分类、链接处理、读取资源、多级目录、变量Doc、Log4j）

URL分别用三个List保存，
一个是boring，这个list中的url最后来下载
其他两个是interesting和average
当搜索到url时检查是否包含设定为boring的词，并放入boring中

用户可设定“深度搜索”：每搜到一个url就放在list的最前面
也可广度

有些网页链接要特殊处理：

url = textReplace("?", URLEncoder.encode("?"), url);
url = textReplace("&", URLEncoder.encode("&"), url);
private String textReplace(String find, String replace, String input)
{
    int startPos = 0;
     while(true)
    {
        int textPos = input.indexOf(find, startPos);
        if(textPos < 0)
        {
            break;
        }
        input = input.substring(0, textPos) + replace + input.substring(textPos + find.length());
        startPos = textPos + replace.length();
    }
    return input;
}

读取资源代码：

BufferedInputStream remoteBIS = new BufferedInputStream(conn.getInputStream());
ByteArrayOutputStream baos = new ByteArrayOutputStream(10240);
byte[] buf = new byte[1024];
int bytesRead = 0;
while(bytesRead >= 0)
{
baos.write(buf, 0, bytesRead);
bytesRead = remoteBIS.read(buf);
}

byte[] content = baos.toByteArray();

建立多级目录：

File f = new File(fileName);
f.getParentFile().mkdirs();
FileOutputStream out = new FileOutputStream(fileName);
out.write(content);
out.flush();
out.close();

给一个变量写doc：（在eclipse中，鼠标置上会显示）

/**
* Set of URLs downloaded or scheduled, so we don't download a
* URL more than once.
* Thread safety: To access the set, first synchronize on it.
*/
private Set urlsDownloadedOrScheduled;

这种log挺好：（apache log4j）

private final static Category _logClass = Category.getInstance(TextSpider.class);

/*
显示信息: 2005-05-01 11:40:44,250 [main] INFO? TextSpider.java:105 - Starting Spider...
*/
_logClass.info("Starting Spider...");

posted on 2006-02-16 14:10 罗明阅读(620) 评论(0) 编辑收藏所属分类: Java

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: Rational Application Developer (RAD) 出现“JVM terminated. Exit code=1”错误的解决方案编程谜题 - 2 编程谜题1 让OSGi支持JSF Web开发 [OSGi]为什么我们需要Bnd？ 2007 Java Developer's Journal Readers' Choice Awards (zz) OSGi框架规范第4版之简要介绍制止打嗝的土方法 JSP标签使用和表单参数中文问题的一些snippets CSS属性列表及对应的JS DOM属性映射 [ZZ]

导航

留言簿(22)

随笔分类(268)

随笔档案(281)

Software

朋友的博客

搜索

积分与排名

最新评论

评论排行榜