﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-stme-随笔分类-搜索引擎方面</title><link>http://www.blogjava.net/stme/category/18708.html</link><description /><language>zh-cn</language><lastBuildDate>Wed, 28 Feb 2007 04:16:19 GMT</lastBuildDate><pubDate>Wed, 28 Feb 2007 04:16:19 GMT</pubDate><ttl>60</ttl><item><title>Nutch 相关 （三） Nutch的分词的架构</title><link>http://www.blogjava.net/stme/archive/2007/01/07/92186.html</link><dc:creator>stme</dc:creator><author>stme</author><pubDate>Sun, 07 Jan 2007 02:23:00 GMT</pubDate><guid>http://www.blogjava.net/stme/archive/2007/01/07/92186.html</guid><wfw:comment>http://www.blogjava.net/stme/comments/92186.html</wfw:comment><comments>http://www.blogjava.net/stme/archive/2007/01/07/92186.html#Feedback</comments><slash:comments>1</slash:comments><wfw:commentRss>http://www.blogjava.net/stme/comments/commentRss/92186.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/stme/services/trackbacks/92186.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: Nutch分词的最底层使用的是lucene的Analyzer抽象类，它位于org.apache.lucene.analysis包中， NutchAnalyzer继承了Analyzer类、实现了Configurable、Pluggable接口，该抽象类中定义了一个公有的抽象方法 tokenStream(String fieldName, Reader reader)返回的类型是TokenStream。&nbsp;&nbsp;<a href='http://www.blogjava.net/stme/archive/2007/01/07/92186.html'>阅读全文</a><img src ="http://www.blogjava.net/stme/aggbug/92186.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/stme/" target="_blank">stme</a> 2007-01-07 10:23 <a href="http://www.blogjava.net/stme/archive/2007/01/07/92186.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Nutch 相关 （二）分词的算法</title><link>http://www.blogjava.net/stme/archive/2007/01/05/90111.html</link><dc:creator>stme</dc:creator><author>stme</author><pubDate>Fri, 05 Jan 2007 03:45:00 GMT</pubDate><guid>http://www.blogjava.net/stme/archive/2007/01/05/90111.html</guid><wfw:comment>http://www.blogjava.net/stme/comments/90111.html</wfw:comment><comments>http://www.blogjava.net/stme/archive/2007/01/05/90111.html#Feedback</comments><slash:comments>4</slash:comments><wfw:commentRss>http://www.blogjava.net/stme/comments/commentRss/90111.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/stme/services/trackbacks/90111.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要:    说到Nutch中要使用中文分词，因为中文分词程序的速度很快，需要分词的每篇文章字数不会达到需要占用其很长时间的程度。因此，对于每篇文章分词的请求可以看作是大量短小线程的请求，此时使用线程池技术是非常合适的，它可以极大减小线程的创建和销毁次数，提高程序的工作效率。&nbsp;&nbsp;<a href='http://www.blogjava.net/stme/archive/2007/01/05/90111.html'>阅读全文</a><img src ="http://www.blogjava.net/stme/aggbug/90111.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/stme/" target="_blank">stme</a> 2007-01-05 11:45 <a href="http://www.blogjava.net/stme/archive/2007/01/05/90111.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Nutch 相关 （一） 爬虫的研究</title><link>http://www.blogjava.net/stme/archive/2007/01/04/91788.html</link><dc:creator>stme</dc:creator><author>stme</author><pubDate>Thu, 04 Jan 2007 09:18:00 GMT</pubDate><guid>http://www.blogjava.net/stme/archive/2007/01/04/91788.html</guid><wfw:comment>http://www.blogjava.net/stme/comments/91788.html</wfw:comment><comments>http://www.blogjava.net/stme/archive/2007/01/04/91788.html#Feedback</comments><slash:comments>1</slash:comments><wfw:commentRss>http://www.blogjava.net/stme/comments/commentRss/91788.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/stme/services/trackbacks/91788.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: Nutch是支持插件扩展的，这样就可以满足各个不同使用群体的特定需求，例如是要做垂直搜索，并收集特定信息的收集&nbsp;&nbsp;<a href='http://www.blogjava.net/stme/archive/2007/01/04/91788.html'>阅读全文</a><img src ="http://www.blogjava.net/stme/aggbug/91788.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/stme/" target="_blank">stme</a> 2007-01-04 17:18 <a href="http://www.blogjava.net/stme/archive/2007/01/04/91788.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>