﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-经验不在于年限，在于积累-随笔分类-Spider</title><link>http://www.blogjava.net/hankchen/category/41473.html</link><description>欢迎来到陈新汉的个人博客</description><language>zh-cn</language><lastBuildDate>Tue, 22 Sep 2009 13:02:07 GMT</lastBuildDate><pubDate>Tue, 22 Sep 2009 13:02:07 GMT</pubDate><ttl>60</ttl><item><title>Webharvest网络爬虫应用总结</title><link>http://www.blogjava.net/hankchen/archive/2009/09/22/296000.html</link><dc:creator>陈新汉</dc:creator><author>陈新汉</author><pubDate>Tue, 22 Sep 2009 03:58:00 GMT</pubDate><guid>http://www.blogjava.net/hankchen/archive/2009/09/22/296000.html</guid><wfw:comment>http://www.blogjava.net/hankchen/comments/296000.html</wfw:comment><comments>http://www.blogjava.net/hankchen/archive/2009/09/22/296000.html#Feedback</comments><slash:comments>1</slash:comments><wfw:commentRss>http://www.blogjava.net/hankchen/comments/commentRss/296000.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/hankchen/services/trackbacks/296000.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: &nbsp;Web-Harvest是一个Java开源Web数据抽取工具。它能够收集指定的Web页面并从这些页面中提取有用的数据。其实现原理是，根据预先定义的配置文件用httpclient获取页面的全部内容（关于httpclient的内容，本博有些文章已介绍），然后运用XPath、XQuery、正则表达式等这些技术来实现对text/xml的内容筛选操作，选取精确的数据。前两年比较火的垂直搜索（比...&nbsp;&nbsp;<a href='http://www.blogjava.net/hankchen/archive/2009/09/22/296000.html'>阅读全文</a><img src ="http://www.blogjava.net/hankchen/aggbug/296000.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/hankchen/" target="_blank">陈新汉</a> 2009-09-22 11:58 <a href="http://www.blogjava.net/hankchen/archive/2009/09/22/296000.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>