﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-anchor110-文章分类-lucene、solr等搜索技术相关</title><link>http://www.blogjava.net/anchor110/category/50046.html</link><description /><language>zh-cn</language><lastBuildDate>Wed, 22 Aug 2012 11:22:21 GMT</lastBuildDate><pubDate>Wed, 22 Aug 2012 11:22:21 GMT</pubDate><ttl>60</ttl><item><title>htmlparser中获取网页超链接的两种方法</title><link>http://www.blogjava.net/anchor110/articles/386059.html</link><dc:creator>小一败涂地</dc:creator><author>小一败涂地</author><pubDate>Wed, 22 Aug 2012 10:33:00 GMT</pubDate><guid>http://www.blogjava.net/anchor110/articles/386059.html</guid><wfw:comment>http://www.blogjava.net/anchor110/comments/386059.html</wfw:comment><comments>http://www.blogjava.net/anchor110/articles/386059.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/anchor110/comments/commentRss/386059.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/anchor110/services/trackbacks/386059.html</trackback:ping><description><![CDATA[方法一：<br /><div style="background-color: #eeeeee; font-size: 13px; border: 1px solid #cccccc; padding: 4px 5px 4px 4px; width: 98%;"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #000000; ">URL&nbsp;url&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;URL(pageUrl);<br />URLConnection&nbsp;conn&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;url.openConnection();<br />parser&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;Parser(conn);<br /><div>NodeList list&nbsp;<span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;parser.parse(</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;TagNameFilter(</span><span style="color: #000000; ">"</span><span style="color: #000000; ">a</span><span style="color: #000000; ">"</span><span style="color: #000000; ">));</span></div></span></div><br />方法二：<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #000000; ">HTMLLinkBean&nbsp;htmlLinkBean&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;HTMLLinkBean();<br />htmlLinkBean.setURL(</span><span style="color: #000000; ">"</span><span style="color: #000000; ">http://www.sohu.com</span><span style="color: #000000; ">"</span><span style="color: #000000; ">);<br />URL[]&nbsp;urls&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;htmlLinkBean.getLinks();<br /></span></div><img src ="http://www.blogjava.net/anchor110/aggbug/386059.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/anchor110/" target="_blank">小一败涂地</a> 2012-08-22 18:33 <a href="http://www.blogjava.net/anchor110/articles/386059.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>org.htmlparser.util.EncodingChangeException: character mismatch问题解决</title><link>http://www.blogjava.net/anchor110/articles/386057.html</link><dc:creator>小一败涂地</dc:creator><author>小一败涂地</author><pubDate>Wed, 22 Aug 2012 10:04:00 GMT</pubDate><guid>http://www.blogjava.net/anchor110/articles/386057.html</guid><wfw:comment>http://www.blogjava.net/anchor110/comments/386057.html</wfw:comment><comments>http://www.blogjava.net/anchor110/articles/386057.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/anchor110/comments/commentRss/386057.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/anchor110/services/trackbacks/386057.html</trackback:ping><description><![CDATA[场景：<br />项目中利用htmlparser抽取网页中的超链接，代码如下：<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #000000; ">URL&nbsp;url&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;URL(pageUrl);<br />URLConnection&nbsp;conn&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;url.openConnection();<br />parser&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;Parser(conn);<br />list&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;parser.parse(</span><span style="color: #0000FF; ">new</span><span style="color: #000000; ">&nbsp;TagNameFilter(</span><span style="color: #000000; ">"</span><span style="color: #000000; ">a</span><span style="color: #000000; ">"</span><span style="color: #000000; ">));</span></div>传参数pageUrl="http://tv.sohu.com/movie/"运行时，报错：<br /><div>org.htmlparser.util.EncodingChangeException: character mismatch。<br /><br />解决办法：<br />修改htmlparser.jar中的org.htmlparser.tags.MetaTag.java，修改如下：<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #0000FF; ">public</span><span style="color: #000000; ">&nbsp;</span><span style="color: #0000FF; ">void</span><span style="color: #000000; ">&nbsp;doSemanticAction() </span><span style="color: #0000FF; ">throws</span><span style="color: #000000; "> ParserException&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;httpEquiv;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;charset;<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;httpEquiv&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;getHttpEquiv();<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000FF; ">if</span><span style="color: #000000; ">&nbsp;(</span><span style="color: #000000; ">"</span><span style="color: #000000; ">Content-Type</span><span style="color: #000000; ">"</span><span style="color: #000000; ">.equalsIgnoreCase(httpEquiv))&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000FF; ">if</span><span style="color: #000000; ">&nbsp;(Page.DEFAULT_CHARSET&nbsp;</span><span style="color: #000000; ">==</span><span style="color: #000000; ">&nbsp;getPage().getEncoding())&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;charset&nbsp;</span><span style="color: #000000; ">=</span><span style="color: #000000; ">&nbsp;getPage().getCharset(getAttribute(</span><span style="color: #000000; ">"</span><span style="color: #000000; ">CONTENT</span><span style="color: #000000; ">"</span><span style="color: #000000; ">));<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;getPage().setEncoding(charset);<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />&nbsp;&nbsp;&nbsp;&nbsp;}</span></div>重新运行，问题解决。</div><img src ="http://www.blogjava.net/anchor110/aggbug/386057.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/anchor110/" target="_blank">小一败涂地</a> 2012-08-22 18:04 <a href="http://www.blogjava.net/anchor110/articles/386057.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>solr分词中文支持问题</title><link>http://www.blogjava.net/anchor110/articles/363201.html</link><dc:creator>小一败涂地</dc:creator><author>小一败涂地</author><pubDate>Tue, 08 Nov 2011 10:29:00 GMT</pubDate><guid>http://www.blogjava.net/anchor110/articles/363201.html</guid><wfw:comment>http://www.blogjava.net/anchor110/comments/363201.html</wfw:comment><comments>http://www.blogjava.net/anchor110/articles/363201.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/anchor110/comments/commentRss/363201.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/anchor110/services/trackbacks/363201.html</trackback:ping><description><![CDATA[solr搭建完毕，写了测试程序，搜索中文，怎么搜也搜不出来。<br /><br />后来发现，是TOMCAT服务器字符编码问题造成的。<br />于是，在conf/server.xml中，添加字符编码：<br /><div>&lt;Connector connectionTimeout="20000" port="8088" protocol="HTTP/1.1" redirectPort="8443" URIEncoding="utf-8" /&gt;</div>注意：URIEncoding大小写得对，如果写错了，无效。<br /><br /><img src ="http://www.blogjava.net/anchor110/aggbug/363201.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/anchor110/" target="_blank">小一败涂地</a> 2011-11-08 18:29 <a href="http://www.blogjava.net/anchor110/articles/363201.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>solr与paoding分词的集成</title><link>http://www.blogjava.net/anchor110/articles/362862.html</link><dc:creator>小一败涂地</dc:creator><author>小一败涂地</author><pubDate>Sat, 05 Nov 2011 13:59:00 GMT</pubDate><guid>http://www.blogjava.net/anchor110/articles/362862.html</guid><wfw:comment>http://www.blogjava.net/anchor110/comments/362862.html</wfw:comment><comments>http://www.blogjava.net/anchor110/articles/362862.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/anchor110/comments/commentRss/362862.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/anchor110/services/trackbacks/362862.html</trackback:ping><description><![CDATA[<div>集成步骤：<br /> 1、去 <div>http://paoding.googlecode.com/svn/trunk/paoding-analysis/</div> 下载paoding项目。<br /> 2、添加环境变量： <div>PAODING_DIC_HOME＝D:\work\paoding-analysis\dic，注&#8220;D:\work\paoding-analysis\dic&#8221;为pading项目的dic目录位置。<br /> 3、修改   <div>D:\work\paoding-analysis\dic下的paoding-dic-names.properties文件，增加：paoding.dic.home=D:\work\paoding-analysis\dic<br /> 4、运行：D:\work\paoding-analysis下的build.bat,生成新的paoding-analysis.jar文件。<br /> 5、将上一步生成的paoding-analysis.jar拷贝至solr.war/WEN-INF/lib下。<br /> 6、修改solr目录中的配置文件，比如E:\solr-tomcat\solr\conf\scheme.xml，在里面添加filetype，使用paoding分词解析器，如下：<br />     <div>&nbsp;&nbsp;&nbsp; &lt;fieldType name="text" class="solr.TextField"&gt; &nbsp;<br />&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&lt;analyzer class="net.paoding.analysis.analyzer.PaodingAnalyzer"&gt;&lt;/analyzer&gt; &nbsp;<br />&nbsp;&nbsp; &nbsp;&lt;/fieldType&gt;<br /> 7、重启solr服务，生效。登录：       <div>http://localhost:8080/solr/admin/analysis.jsp</div> 即可测试验证安装结果。<br />     </div>     <br />   </div>   <br /> </div> </div><img src ="http://www.blogjava.net/anchor110/aggbug/362862.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/anchor110/" target="_blank">小一败涂地</a> 2011-11-05 21:59 <a href="http://www.blogjava.net/anchor110/articles/362862.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>