﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-刀剑笑-随笔分类-SharpICTCLAS</title><link>http://www.blogjava.net/jiangyz/category/28462.html</link><description>用技术改善你的生活</description><language>zh-cn</language><lastBuildDate>Fri, 28 Dec 2007 21:39:30 GMT</lastBuildDate><pubDate>Fri, 28 Dec 2007 21:39:30 GMT</pubDate><ttl>60</ttl><item><title>SharpICTCLAS 1.0 发布! （转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171318.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:55:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171318.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171318.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171318.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171318.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171318.html</trackback:ping><description><![CDATA[<p><font color="#ff0000">SharpICTCLAS 1.0 发布 （感谢<a href="http://www.gk-z.com/" target="_blank">工控网</a>发现了一个问题，问题出在字符串比较上，目前已经修正，请重新下载。2007年4月20日）</font></p>
<ul>
    <li><a href="http://www.cnblogs.com/Files/zhenyulu/SharpICTCLAS分词系统_1.0.rar">下载 SharpICTCLAS 1.0</a> </li>
</ul>
<p>　</p>
<h3>一、SharpICTCLAS 1.0 版相对于测试版的改进</h3>
<p>1、修改了原子分词代码，使得对于全角字母有较好的识别</p>
<p>2、修改了部分词性标注部分的代码</p>
<p>因为词性标注部分的代码存在问题（应当是从ICTCLAS就存在的问题），主要表现在如果某个汉字没有词性，则在词性标注时会出现异常。例如：&#8220;这些是永远也没有现成的答桉的&#8221;其中&#8220;答案&#8221;写错了，当对这个有错别字的句子分词时，&#8220;桉&#8221;字是没有词性的，程序在此时将出现错误。</p>
<p>目前的解决办法是对于这些没有词性的词在最终标注时标注为&#8220;字符串&#8221;。</p>
<p>2、修改了地名识别的一些问题</p>
<p>这个问题出现在Span类的PlaceRecognize方法中，nStart与nEnd在某些时候会计算错误。在测试版SharpICTCLAS中，句子&#8220;明定陵是明十三陵中第十座陵墓&#8221;在分词时会因为这个问题导致异常。 </p>
<p>3、修改了基于CCID的字符串比较代码</p>
<p>原有代码没有很好考虑对全角、半角混合字符串的比较问题，现在修正过来了。</p>
<p>4、修改了向词库添加词汇的代码</p>
<p>原有代码存在错误，现在改正了过来。</p>
<h3>二、仍然有待改进的地方</h3>
<p>现在的程序仍然有很多地方有待改进，例如原子分词部分的代码对电子邮件、URL等识别还不是很好，日后可利用正则表达式加以改进；除此之外，对于词性标注以及人名地名识别部分代码 ，我除了修改了部分问题代码外，没有做任何改进和调整，这使得整个代码显得凌乱，有待做一次全面重构。</p>
<h3>三、SharpICTCLAS使用时的一些示例代码</h3>
<p>为了能够更好的使用SharpICTCLAS，现提供一些示例代码，主要完成的工作包括：1）向词库中添加新词汇；2）对文件的预处理，实现繁体向简体的转换、全角字符向半角字符的转换、利用正则表达式过滤多余HTML标记以及断句等。具体可以访问我的文章《<a href="http://www.cnblogs.com/zhenyulu/articles/718375.html">SharpICTCLAS分词系统简介(9)词库扩充</a>》。</p>
<p>目前经过调整后的SharpICTCLAS运行效果还算不错。在对博客园一万五千篇文章进行分词测试过程中，向词库中添加了一千三百多个词汇然后进行分词，效果还不错， 分词异常一共发生了15次，其中有9处是因为存在大量日文字符，另外6处是一句话中单词过多，超出了软件限制（200词）。分词效率也比较令人满意（尽管总体还是比较慢），15000篇文章总用时2.5小时，但这不只是分词的时间，还包括了繁体转简体、利用正则表达式去掉HTML符号，统计词频（这需要进行重复词的判别，我使用了AVL树 ，共统计得到16万词汇）、将分词结果写入SQL Server 2005数据库。如果不考虑这些因素的话，感觉应当和C＋＋程序效率差不多，当然这是没有经过严格测试的结论。</p>
<p>如果大家在使用时发现什么新问题，还请及时告知，我会继续修正这些问题。</p>
<p>　</p>
<hr align="left" width="400" />
<p>　</p>
<ul>
    <li><font color="#800080"><strong>ICTCLAS简介：</strong></font> </li>
</ul>
<p>计算所汉语词法分析系统ICTCLAS（Institute of Computing Technology, Chinese Lexical Analysis System），功能有：中文分词；词性标注；未登录词识别。分词正确率高达97.58%(973专家评测结果)，未登录词识别召回率均高于90%，其中中国人名的识别召回率接近98%;处理速度为31.5Kbytes/s。</p>
<p>著作权： Copyright(c)2002-2005中科院计算所 职务著作权人：张华平</p>
<p>遵循协议：自然语言处理开放资源许可证1.0</p>
<p>Email: <a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;&#122;&#104;&#97;&#110;&#103;&#104;&#112;&#64;&#115;&#111;&#102;&#116;&#119;&#97;&#114;&#101;&#46;&#105;&#99;&#116;&#46;&#97;&#99;&#46;&#99;&#110;">zhanghp@software.ict.ac.cn</a></p>
<p>Homepage: <a href="http://www.i3s.ac.cn/">http://www.i3s.ac.cn</a></p>
<p>　</p>
<ul>
    <li><strong><font color="#800080">SharpICTCLAS：</font></strong> </li>
</ul>
<p>.net平台下的ICTCLAS，是由河北理工大学经管学院吕震宇根据Free版ICTCLAS改编而成，并对原有代码做了部分重写与调整。</p>
<p>Email: <a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;&#122;&#104;&#101;&#110;&#121;&#117;&#108;&#117;&#64;&#49;&#54;&#51;&#46;&#99;&#111;&#109;">zhenyulu@163.com</a></p>
<p>Blog: <a href="http://www.cnblogs.com/zhenyulu">http://www.cnblogs.com/zhenyulu</a><br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
  <img src ="http://www.blogjava.net/jiangyz/aggbug/171318.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:55 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171318.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(9)词库扩充（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171317.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:43:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171317.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171317.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171317.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171317.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171317.html</trackback:ping><description><![CDATA[<h3>1、SharpICTCLAS中词库的扩充</h3>
<p>如果对SharpICTCLAS目前词库不满意的化，可以考虑扩充现有词库。扩充方法非常简单，代码如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">词库扩充</div>
</div>
<div class="content"><span style="color: #0000ff">static</span> <span style="color: #0000ff">void</span> Main(<span style="color: #0000ff">string</span>[] args) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">string</span> DictPath = Path.Combine(Environment.CurrentDirectory, <span style="color: #ff00ff">"Data"</span>) +<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Path.DirectorySeparatorChar; <br />
&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"正在读入字典，请稍候..."</span>); <br />
<br />
&nbsp;&nbsp; WordDictionary dict = <span style="color: #0000ff">new</span> WordDictionary(); <br />
&nbsp;&nbsp; dict.Load(DictPath + <span style="color: #ff00ff">"coreDict.dct"</span>); <br />
<br />
&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n向字典库插入&#8220;设计模式&#8221;一词..."</span>); <br />
&nbsp;&nbsp; dict.AddItem(<span style="color: #ff00ff">"设计模式"</span>, Utility.GetPOSValue(<span style="color: #ff00ff">"n"</span>), 10); <br />
<br />
&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n修改完成，将字典写入磁盘文件coreDictNew.dct，请稍候..."</span>); <br />
&nbsp;&nbsp; dict.Save(DictPath + <span style="color: #ff00ff">"coreDictNew.dct"</span>); <br />
<br />
&nbsp;&nbsp; Console.Write(<span style="color: #ff00ff">"按下回车键退出......"</span>); <br />
&nbsp;&nbsp; Console.ReadLine(); <br />
}</div>
</div>
<p>通过AddItem方法可以轻松实现添加新词汇，添加时除了要指明词外，还需指明词性、词频。</p>
<h3>2、其它工具</h3>
<p>SharpICTCLAS示例代码中还提供了一些用于对文件进行预处理的工具类PreProcessUtility，里面提供了将GB2312中繁体汉字转换为简体字的代码，以及将全角字母转换为半角字母的方法，除此之外，还提供了对HTML文件进行预处理，去除HTML标记的方法，用户可酌情使用。</p>
<p>　</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>有关SharpICTCLAS的系列文章到此为止就全部结束。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
 <img src ="http://www.blogjava.net/jiangyz/aggbug/171317.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:43 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171317.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(8)其它（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171316.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:38:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171316.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171316.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171316.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171316.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171316.html</trackback:ping><description><![CDATA[<p>前文对SharpICTCLAS中的一些主要内容做了介绍，本文介绍一下SharpICTCLAS中一些其它考虑，包括事件机制以及如何使用SharpICTCLAS。</p>
<h3>1、SharpICTCLAS中的事件</h3>
<p>分词过程比较复杂，所以很可能有人希望能够追踪分词的过程，设置代码断点比较麻烦，因此SharpICTCLAS中提供了事件机制，可以在分词的不同阶段触发相关事件，使用者可以订阅这些事件并输出中间结果供查错使用。</p>
<p>事件的阶段被定义在SegmentStage枚举当中，代码如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">SegmentStage程序</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> <span style="color: #0000ff">enum</span> SegmentStage <br />
{ <br />
&nbsp;&nbsp; BeginSegment,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//开始分词 </span><br />
&nbsp;&nbsp; AtomSegment,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//原子切分 </span><br />
&nbsp;&nbsp; GenSegGraph,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//生成SegGraph </span><br />
&nbsp;&nbsp; GenBiSegGraph,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//生成BiSegGraph </span><br />
&nbsp;&nbsp; NShortPath,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//N最短路径计算 </span><br />
&nbsp;&nbsp; BeforeOptimize,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//对N最短路径进一步整理得到的结果 </span><br />
&nbsp;&nbsp; OptimumSegment,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//初始OptimumSegmentGraph </span><br />
&nbsp;&nbsp; PersonAndPlaceRecognition, <span style="color: #008000">//人名与地名识别后的OptimumSegmentGraph </span><br />
&nbsp;&nbsp; BiOptimumSegment,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//生成BiOptimumSegmentGraph </span><br />
&nbsp;&nbsp; FinishSegment&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//完成分词，输出结果 </span><br />
}</div>
</div>
<p>分别对应分词过程中的10个阶段。</p>
<p>SharpICTCLAS中还定义了一个EventArgs，里面包含了两个元素，分别用来记录该事件元素所处的分词阶段以及该阶段的相关中间结果信息。中间结果信息使用的是string类型数据，日后可以考虑采用更复杂的表示形式输出中间结果。该事件元素定义如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">SegmentEventArgs的定义</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> SegmentEventArgs : EventArgs <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> SegmentStage Stage; <br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">string</span> Info = <span style="color: #ff00ff">""</span>; <br />
<br />
&nbsp;&nbsp; ...... <br />
}</div>
</div>
<p>剩下的工作就是定义委派并发布事件了。由于分词过程主要集中在两个类中：WordSegment类与Segment类，而用户通常只需要与WordSegment类打交道，因此WordSegment类中转发了Segment类产生的事件。</p>
<p>委派的定义以及事件的定义如下（部分）：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">程序</div>
</div>
<div class="content"><span style="color: #008000">//---定义委派 </span><br />
<span style="color: #0000ff">public</span> <span style="color: #0000ff">delegate</span> <span style="color: #0000ff">void</span> SegmentEventHandler(<span style="color: #0000ff">object</span> sender, SegmentEventArgs e); <br />
<br />
<span style="color: #008000">//---定义事件 </span><br />
<span style="color: #0000ff">public</span> <span style="color: #0000ff">event</span> SegmentEventHandler OnSegmentEvent; <br />
<br />
<span style="color: #008000">//---发布事件的方法 </span><br />
<span style="color: #0000ff">private</span> <span style="color: #0000ff">void</span> SendEvents(SegmentEventArgs e) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">if</span> (OnSegmentEvent != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; OnSegmentEvent(<span style="color: #0000ff">this</span>, e); <br />
} <br />
<br />
<span style="color: #008000">//---开始分词 </span><br />
<span style="color: #0000ff">private</span> <span style="color: #0000ff">void</span> OnBeginSegment(<span style="color: #0000ff">string</span> sentence) <br />
{ <br />
&nbsp;&nbsp; SendEvents(<span style="color: #0000ff">new</span> SegmentEventArgs(SegmentStage.BeginSegment, sentence)); <br />
} <br />
<br />
...... <br />
<br />
<span style="color: #008000">//---结束分词 </span><br />
<span style="color: #0000ff">private</span> <span style="color: #0000ff">void</span> OnFinishSegment(List&lt;WordResult[]&gt; m_pWordSeg) <br />
{ <br />
&nbsp;&nbsp; StringBuilder sb = <span style="color: #0000ff">new</span> StringBuilder(); <br />
&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> k = 0; k &lt; m_pWordSeg.Count; k++) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> j = 0; j &lt; m_pWordSeg[k].Length; j++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sb.Append(<span style="color: #0000ff">string</span>.Format(<span style="color: #ff00ff">"{0} /{1} "</span>, m_pWordSeg[k][j].sWord,&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Utility.GetPOSString(m_pWordSeg[k][j].nPOS))); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sb.Append(<span style="color: #ff00ff">"\r\n"</span>); <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; SendEvents(<span style="color: #0000ff">new</span> SegmentEventArgs(SegmentStage.FinishSegment, sb.ToString())); <br />
}</div>
</div>
<p>有了这些事件，用户可以根据需要订阅不同的事件来获取分词的中间结果，极大方便了程序调试工作。</p>
<h3>2、SharpICTCLAS的使用</h3>
<p>下面是一个使用SharpICTCLAS的示例代码：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">WordSegmentSample.cs</div>
</div>
<div class="content"><span style="color: #0000ff">using</span> System; <br />
<span style="color: #0000ff">using</span> System.Collections.Generic; <br />
<span style="color: #0000ff">using</span> System.Text; <br />
<strong><span style="color: #0000ff">using</span> SharpICTCLAS; </strong><br />
<br />
<span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> WordSegmentSample <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">private</span> <span style="color: #0000ff">int</span> nKind = 1;&nbsp; <span style="color: #008000">//在NShortPath方法中用来决定初步切分时分成几种结果 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">private</span> WordSegment wordSegment; <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 构造函数，在没有指明nKind的情况下，nKind 取 1 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> WordSegmentSample(<span style="color: #0000ff">string</span> dictPath) : <span style="color: #0000ff">this</span>(dictPath, 1) { } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 构造函数 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> WordSegmentSample(<span style="color: #0000ff">string</span> dictPath, <span style="color: #0000ff">int</span> nKind) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">this</span>.nKind = nKind; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <strong><span style="color: #0000ff">this</span>.wordSegment = <span style="color: #0000ff">new</span> WordSegment(); </strong><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//---------- 订阅分词过程中的事件 ---------- </span><br />
<strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wordSegment.OnSegmentEvent += <span style="color: #0000ff">new</span> SegmentEventHandler(<span style="color: #0000ff">this</span>.OnSegmentEventHandler); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wordSegment.InitWordSegment(dictPath); </strong><br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 开始分词 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> List&lt;WordResult[]&gt; Segment(<span style="color: #0000ff">string</span> sentence) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <strong><span style="color: #0000ff">return</span> wordSegment.Segment(sentence, nKind); </strong><br />
&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 输出分词过程中每一步的中间结果 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">private</span> <span style="color: #0000ff">void</span> OnSegmentEventHandler(<span style="color: #0000ff">object</span> sender, SegmentEventArgs e) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">switch</span> (e.Stage) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.BeginSegment: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 原始句子：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info + <span style="color: #ff00ff">"\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.AtomSegment: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 原子切分：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.GenSegGraph: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 生成 segGraph：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.GenBiSegGraph: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 生成 biSegGraph：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.NShortPath: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== NShortPath 初步切分的到的 N 个结果：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.BeforeOptimize: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 经过数字、日期合并等策略处理后的 N 个结果：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.OptimumSegment: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 将 N 个结果归并入OptimumSegment：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.PersonAndPlaceRecognition: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 加入对姓名、翻译人名以及地名的识别：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.BiOptimumSegment: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 对加入对姓名、地名的OptimumSegment生成BiOptimumSegment：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.FinishSegment: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 最终识别结果：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp; } <br />
} <br />
</div>
</div>
<p>从中我们可以看出，首先添加对SharpICTCLAS命名空间的引用，然后创建WordSegment类的一个实例。如果需要拦截分词过程中的事件的话，那么可以订阅WordSegment类的OnSegmentEvent事件，上面的代码用OnSegmentEventHandler方法订阅了事件，并且输出了所有分词Stage的中间结果。</p>
<p>WordSegmentSample类中的 nKind 属性是在NShortPath方法中用来决定初步切分时分成几种结果。如果不特殊指明，nKind取1，用户也可以自己定义一个1～10之间的整数（超过10，系统自动取10），数越大分词准确率越高（可以参考张华平的论文），但系统执行效率会下降。</p>
<p>WordSegment类的InitWordSegment方法主要用来初始化各个词典，用户在这里需要提供词典所在的目录信息，系统自动到该目录下搜索所有词典。</p>
<p>有了WordSegmentSample类，主程序如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">程序</div>
</div>
<div class="content"><span style="color: #0000ff">using</span> System; <br />
<span style="color: #0000ff">using</span> System.Collections.Generic; <br />
<span style="color: #0000ff">using</span> System.Text; <br />
<span style="color: #0000ff">using</span> System.IO; <br />
<span style="color: #0000ff">using</span> SharpICTCLAS; <br />
<br />
<span style="color: #0000ff">class</span> Program <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">static</span> <span style="color: #0000ff">void</span> Main(<span style="color: #0000ff">string</span>[] args) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; List&lt;WordResult[]&gt; result; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">string</span> DictPath = Path.Combine(Environment.CurrentDirectory, <span style="color: #ff00ff">"Data"</span>) +&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Path.DirectorySeparatorChar; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"正在初始化字典库，请稍候..."</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WordSegmentSample sample = <span style="color: #0000ff">new</span> WordSegmentSample(DictPath, 5); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; result = sample.Segment(@<span style="color: #ff00ff">"王晓平在1月份滦南大会上说的确实在理"</span>); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//---------- 输出结果 ---------- </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Console.WriteLine("\r\n==== 最终识别结果：\r\n"); </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//for (int i = 0; i &lt; result.Count; i++) </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//{ </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//&nbsp;&nbsp; for (int j = 0; j &lt; result[i].Length; j++) </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.Write("{0} /{1} ", result[i][j].sWord, Utility.GetPOSString(result[i][j].nPOS)); </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//&nbsp;&nbsp; Console.WriteLine(); </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//} </span><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.Write(<span style="color: #ff00ff">"按下回车键退出......"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.ReadLine(); <br />
&nbsp;&nbsp; } <br />
} <br />
</div>
</div>
<p>内容比较简单，此处就不再多说。由于我们在WordSegmentSample中订阅了所有阶段的事件，因此程序会输出整个过程各个阶段的中间结果，也包括最终分词结果，因此上面代码中我将输出结果部分的代码注释起来了。如果没有订阅任何事件的话，可以使用注释起来的这段代码输出分词最终结果。</p>
<p>该程序的执行结果如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">WordSegmentSample程序执行结果</div>
</div>
<div class="content">正在初始化字典库，请稍候... <br />
<br />
<span style="color: #008000">//==== 原始句子： </span><br />
<br />
王晓平在1月份滦南大会上说的确实在理 <br />
<br />
<br />
<span style="color: #008000">//==== 原子切分： </span><br />
<br />
始##始, 王, 晓, 平, 在, 1, 月, 份, 滦, 南, 大, 会, 上, 说, 的, 确, 实, 在, 理, 末##末,&nbsp; <br />
<br />
<br />
<span style="color: #008000">//==== 生成 segGraph： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 218.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 271.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp; 1900.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:月 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:月份 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 1234.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.00,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 813.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight:&nbsp; 14536.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp; 1333.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会 <br />
row: 11,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp; 6136.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 469.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp; 23706.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:上 <br />
row: 13,&nbsp; col: 14,&nbsp; eWeight:&nbsp; 17649.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说 <br />
row: 14,&nbsp; col: 15,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row: 14,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 210.00,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确 <br />
row: 15,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 181.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确 <br />
row: 15,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 361.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实 <br />
row: 16,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 357.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实 <br />
row: 16,&nbsp; col: 18,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 295.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实在 <br />
row: 17,&nbsp; col: 18,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row: 17,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.00,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理 <br />
row: 18,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 129.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理 <br />
row: 19,&nbsp; col: 20,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末 <br />
<br />
<br />
<span style="color: #008000">//==== 生成 biSegGraph： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.18,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始@王 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.46,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王@晓 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.93,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓@平 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.25,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平@在 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.74,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@未##数 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 6,&nbsp; eWeight: -27898.79,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数@月 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight: -27898.75,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数@月份 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.33,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:月@份 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.83,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:月份@滦 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.76,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份@滦 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 14.46,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦@南 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.19,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南@大 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.17,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南@大会 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.30,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大@会 <br />
row: 11,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.30,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大@会上 <br />
row: 12,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.11,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会@上 <br />
row: 13,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8.16,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会@上 <br />
row: 14,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.42,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上@说 <br />
row: 15,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.07,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:上@说 <br />
row: 16,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.05,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说@的 <br />
row: 16,&nbsp; col: 18,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.11,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说@的确 <br />
row: 17,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.10,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的@确 <br />
row: 17,&nbsp; col: 20,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.10,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的@确实 <br />
row: 18,&nbsp; col: 21,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.49,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确@实 <br />
row: 19,&nbsp; col: 21,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.63,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确@实 <br />
row: 18,&nbsp; col: 22,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.49,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确@实在 <br />
row: 19,&nbsp; col: 22,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.63,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确@实在 <br />
row: 20,&nbsp; col: 23,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.92,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在 <br />
row: 21,&nbsp; col: 23,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.98,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实@在 <br />
row: 20,&nbsp; col: 24,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.97,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在理 <br />
row: 21,&nbsp; col: 24,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.98,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实@在理 <br />
row: 22,&nbsp; col: 25,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.17,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实在@理 <br />
row: 23,&nbsp; col: 25,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.62,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@理 <br />
row: 24,&nbsp; col: 26,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 14.30,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理@末##末 <br />
row: 25,&nbsp; col: 26,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.95,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理@末##末 <br />
<br />
<br />
<span style="color: #008000">//==== NShortPath 初步切分的到的 N 个结果： </span><br />
<br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大会, 上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大, 会上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大会, 上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月, 份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
<br />
<br />
<span style="color: #008000">//==== 经过数字、日期合并等策略处理后的 N 个结果： </span><br />
<br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大会, 上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大会, 上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月, 份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
<br />
<br />
<span style="color: #008000">//==== 加入对姓名、翻译人名以及地名的识别： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 218.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.86,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.27,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 271.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 1234.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.00,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦 <br />
row:&nbsp; 8,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 20.37,&nbsp;&nbsp; nPOS: -28275,&nbsp;&nbsp; sWord:未##地 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 813.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight:&nbsp; 14536.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp; 1333.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 469.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp; 23706.00,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数 <br />
row: 13,&nbsp; col: 14,&nbsp; eWeight:&nbsp; 17649.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说 <br />
row: 14,&nbsp; col: 15,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row: 15,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 361.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实 <br />
row: 17,&nbsp; col: 18,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row: 17,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.00,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理 <br />
row: 18,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 129.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理 <br />
row: 19,&nbsp; col: 20,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末 <br />
<br />
<br />
<span style="color: #008000">//==== 生成 biSegGraph： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.18,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始@王 <br />
row:&nbsp; 0,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.88,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始@未##人 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.46,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王@晓 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.88,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王@未##人 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.93,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓@平 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 6,&nbsp; eWeight: -28270.43,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人@在 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 6,&nbsp; eWeight: -28270.43,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人@在 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.25,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平@在 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.01,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@未##时 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.01,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@未##时 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 9,&nbsp; eWeight: -29690.16,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时@份 <br />
row:&nbsp; 8,&nbsp; col: 10,&nbsp; eWeight: -29690.16,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时@滦 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.76,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份@滦 <br />
row:&nbsp; 8,&nbsp; col: 11,&nbsp; eWeight: -29690.17,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时@未##地 <br />
row:&nbsp; 9,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.76,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份@未##地 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 14.46,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦@南 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight: -28267.95,&nbsp;&nbsp; nPOS: -28275,&nbsp;&nbsp; sWord:未##地@大 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.19,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南@大 <br />
row: 11,&nbsp; col: 14,&nbsp; eWeight: -28266.85,&nbsp;&nbsp; nPOS: -28275,&nbsp;&nbsp; sWord:未##地@大会 <br />
row: 12,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.17,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南@大会 <br />
row: 13,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.30,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大@会上 <br />
row: 14,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.81,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会@未##数 <br />
row: 15,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.42,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上@说 <br />
row: 16,&nbsp; col: 17,&nbsp; eWeight: -27898.75,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数@说 <br />
row: 17,&nbsp; col: 18,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.05,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说@的 <br />
row: 18,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.10,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的@确实 <br />
row: 19,&nbsp; col: 20,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.92,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在 <br />
row: 19,&nbsp; col: 21,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.97,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在理 <br />
row: 20,&nbsp; col: 22,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.62,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@理 <br />
row: 21,&nbsp; col: 23,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 14.30,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理@末##末 <br />
row: 22,&nbsp; col: 23,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.95,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理@末##末 <br />
<br />
<br />
<span style="color: #008000">//==== 最终识别结果： </span><br />
<br />
王晓平 /nr 在 /p&nbsp; 1月份 /t&nbsp; 滦南 /ns 大会 /n&nbsp; 上 /v&nbsp; 说 /v&nbsp; 的 /uj 确实 /ad 在 /p&nbsp; 理 /n </div>
</div>
<p>　</p>
<p><font color="#ff0000">非常高兴在这最后一篇文章写完之时得到了张华平老师的授权。我会尽可能快的将SharpICTCLAS源文件放上来供大家测试使用的。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</font></p>
  <img src ="http://www.blogjava.net/jiangyz/aggbug/171316.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:38 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171316.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(7)OptimumSegment（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171314.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:34:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171314.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171314.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171314.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171314.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171314.html</trackback:ping><description><![CDATA[<p>上一篇文章说到经过NShortPath计算后，我们得到了数个候选分词方案，那么这么多个候选分词方案是如何最终成为一个分词结果的呢？其实这个过程是靠OptimumSegment完成的。SharpICTCLAS与ICTCLAS的OptimumSegment过程基本一样没有太大的变化。</p>
<h3>1、OptimumSegment的运算过程</h3>
<p>经过NShortPath处理后的多个结果首先会经过日期合并策略的处理，这就是前文说的GenerateWord方法完成的功能。在GenerateWord方法中可以看到如下命令：</p>
<p><code>&nbsp;m_graphOptimum.SetElement(pCur.row, pCur.col, ......);</code></p>
<p>它的功能就是将所有得到的多个分词方案合并归入一个<code> m_graphOptimum </code>属性，如下面的NShortPath运算结果，经过归并后，在<code> m_graphOptimum </code>属性中将包含所有红色词。</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">经过NShortPath处理后的初步分词结果</div>
</div>
<div class="content"><span style="color: #008000">//==== 原始句子： </span><br />
<br />
王晓平在1月份滦南大会上说的确实在理 <br />
<br />
<span style="color: #008000">//==== NShortPath 初步切分的到的 N 个结果： </span><br />
<br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大会, 上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大, 会上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大会, 上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月, 份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
<br />
<br />
<span style="color: #008000">//==== 经过数字、日期合并等策略处理后的 N 个结果： </span><br />
<br />
<font color="#ff0000"><strong>始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,</strong>&nbsp; </font><br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, <strong><font color="#ff0000">大会</font></strong>, 上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, <strong><font color="#ff0000">在理</font></strong>, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大会, 上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, <strong><font color="#ff0000">1月</font></strong>, <strong><font color="#ff0000">份</font></strong>, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
</div>
</div>
<p>紧接着对归并后的<code> m_graphOptimum </code>进行人名与地名的识别，找出所有可能的人名、地名方案，经过人名、地名识别后的<code> m_graphOptimum </code>如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">加入人名、地名识别</div>
</div>
<div class="content"><span style="color: #008000">//==== 加入对姓名、翻译人名以及地名的识别： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 218.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.86,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.27,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 271.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 1234.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.00,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦 <br />
row:&nbsp; 8,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 20.37,&nbsp;&nbsp; nPOS: -28275,&nbsp;&nbsp; sWord:未##地 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 813.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight:&nbsp; 14536.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp; 1333.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 469.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp; 23706.00,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数 <br />
row: 13,&nbsp; col: 14,&nbsp; eWeight:&nbsp; 17649.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说 <br />
row: 14,&nbsp; col: 15,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row: 15,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 361.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实 <br />
row: 17,&nbsp; col: 18,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row: 17,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.00,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理 <br />
row: 18,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 129.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理 <br />
row: 19,&nbsp; col: 20,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末 <br />
</div>
</div>
<p>到此为止，<code> m_graphOptimum </code>包含了所有最终分词结果中可能包含的元素（人名、地名以及NShortPath筛选后所有可能组词方案），Segment类对这个<code> m_graphOptimum </code>再次使用NShortPath，并计算出最优结果作为最终的分词方案。</p>
<p>整个过程可从WordSegment类的Segment方法看出，SharpICTCLAS中该方法定义如下（经过简化）：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">分词主程序</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> List&lt;WordResult[]&gt; Segment(<span style="color: #0000ff">string</span> sentence, <span style="color: #0000ff">int</span> nKind) <br />
{ <br />
&nbsp;&nbsp; m_pNewSentence = Predefine.SENTENCE_BEGIN + sentence + Predefine.SENTENCE_END; <br />
&nbsp;&nbsp; <span style="color: #008000">//---初步分词 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> nResultCount = m_Seg.BiSegment(m_pNewSentence, m_dSmoothingPara, nKind); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---人名、地名识别 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> i = 0; i &lt; nResultCount; i++) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_uPerson.Recognition(m_Seg.m_pWordSeg[i], m_Seg.m_graphOptimum, ......); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_uTransPerson.Recognition(m_Seg.m_pWordSeg[i], m_Seg.m_graphOptimum, ......); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_uPlace.Recognition(m_Seg.m_pWordSeg[i], m_Seg.m_graphOptimum, ......); <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---最终优化 </span><br />
&nbsp;&nbsp; m_Seg.BiOptimumSegment(1, m_dSmoothingPara); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---词性标注 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> i = 0; i &lt; m_Seg.m_pWordSeg.Count; i++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_POSTagger.POSTagging(m_Seg.m_pWordSeg[i], m_dictCore, m_dictCore); <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">return</span> m_Seg.m_pWordSeg; <br />
}</div>
</div>
<h3>2、人名与地名的识别</h3>
<p>ICTCLAS中人名的识别采用的是模板匹配的方法，首先对初步分词得到的多的结果计算词性，然后根据词性串对人名信息进行匹配。整个运算过程如下：</p>
<p>首先定义了人名匹配模板：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">人名识别模板</div>
</div>
<div class="content"><span style="color: #0000ff">string</span>[] sPatterns = { <span style="color: #ff00ff">"BBCD"</span>, <span style="color: #ff00ff">"BBC"</span>, <span style="color: #ff00ff">"BBE"</span>, <span style="color: #ff00ff">"BBZ"</span>, <span style="color: #ff00ff">"BCD"</span>, <br />
<span style="color: #ff00ff">"BEE"</span>, <span style="color: #ff00ff">"BE"</span>, <span style="color: #ff00ff">"BG"</span>, <span style="color: #ff00ff">"BXD"</span>, <span style="color: #ff00ff">"BZ"</span>, <span style="color: #ff00ff">"CDCD"</span>, <span style="color: #ff00ff">"CD"</span>, <span style="color: #ff00ff">"EE"</span>, <span style="color: #ff00ff">"FB"</span>,&nbsp; <br />
<span style="color: #ff00ff">"Y"</span>, <span style="color: #ff00ff">"XD"</span>, <span style="color: #ff00ff">""</span> }; <br />
<span style="color: #008000">/*------------------------------------ <br />
The person recognition patterns set <br />
BBCD:姓+姓+名1+名2; <br />
BBE: 姓+姓+单名; <br />
BBZ: 姓+姓+双名成词; <br />
BCD: 姓+名1+名2; <br />
BE:&nbsp; 姓+单名; <br />
BEE: 姓+单名+单名;韩磊磊 <br />
BG:&nbsp; 姓+后缀 <br />
BXD: 姓+姓双名首字成词+双名末字 <br />
BZ:&nbsp; 姓+双名成词; <br />
B:&nbsp;&nbsp; 姓 <br />
CD:&nbsp; 名1+名2; <br />
EE:&nbsp; 单名+单名; <br />
FB:&nbsp; 前缀+姓 <br />
XD:&nbsp; 姓双名首字成词+双名末字 <br />
Y:&nbsp;&nbsp; 姓单名成词 <br />
------------------------------------*/</span></div>
</div>
<p>然后将初步分词得到的结果进行词性标注，清理掉其它不必要的信息后进行模板匹配得到人名：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">人名识别过程</div>
</div>
<div class="content"><span style="color: #008000">//==== 经过初步分词后的一个结果集 </span><br />
<br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
<br />
<span style="color: #008000">//==== 经过计算得到的m_nBestTag </span><br />
<br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff0000">B&nbsp;&nbsp; C&nbsp;&nbsp; D</font>&nbsp;&nbsp; M&nbsp;&nbsp; A&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; A&nbsp;&nbsp; A&nbsp;&nbsp; A&nbsp;&nbsp; A&nbsp;&nbsp;&nbsp;&nbsp; A&nbsp;&nbsp; A&nbsp;&nbsp; A&nbsp;&nbsp;&nbsp;&nbsp; A&nbsp;&nbsp; A <br />
<br />
<span style="color: #008000">//==== 经过模板匹配后识别出来的人名 </span><br />
<br />
王晓平</div>
</div>
<p>地名的识别与此类似，就不再多说。有关人名、地名识别的进一步内容可以参考：<a href="http://qxred.yculblog.com/post.1204714.html">http://qxred.yculblog.com/post.1204714.html</a>；《ICTCLAS 中科院分词系统 代码 注释 中文分词 词性标注》作者：风暴红QxRed 。</p>
<p>　</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>经过NShortPath得到的多个初步分词结果被归并入<code> m_graphOptimum </code>，然后经过人名与地名识别过程将所有可能的人名、地名也放入其中，最后通过OptimumSegment方法最终得到分词结果。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
 <img src ="http://www.blogjava.net/jiangyz/aggbug/171314.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:34 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171314.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(6)Segment（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171311.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:18:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171311.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171311.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171311.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171311.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171311.html</trackback:ping><description><![CDATA[<p>DynamicArray与NShortPath是ICTCLAS中的基础类，本人在完成了基础改造工作后，就着手开始对Segment分词进行移植与改造。SharpICTCLAS中的改造主要体现在以下几方面：</p>
<p><font color="#800080"><strong>1）合并不同类中的部分代码</strong></font></p>
<p>原有ICTCLAS中使用了SegGraph与Segment两个类完成分词过程，SegGraph类负责完成原子分词与segGraph的生成，而Segment类负责BiSegGraph的生成和NShortPath优化，而最终的人名、地名识别以及Optimum优化又分散在了Segment类与WordSegment类中。</p>
<p>SharpICTCLAS将原有SegGraph类与Segment合二为一，因为它们所作的工作仅仅是分词中的几个步骤而已。而WordSegment类中基本保留了原有内容，因为这个类更多的做一些外围工作。</p>
<p><font color="#800080"><strong>2）改造了程序中用到的部分数据结构</strong></font></p>
<p>原有ICTCLAS大量使用了数组与二维数组，由于数组的固有缺陷使得我们随处可以看到如此这般的数组定义：</p>
<p><code>m_pWordSeg = new PWORD_RESULT[MAX_SEGMENT_NUM];</code></p>
<p>由于不知道最终会分成几个词，所以定义数组时只能用最大的容量<code> MAX_SEGMENT_NUM </code>进行预设，所以一旦存在某些异常数据就会造成&#8220;溢出&#8221;错误。</p>
<p>而SharpICTCLAS中大量使用了<code> List&lt;int[]&gt; </code>的方式记录结果 ，范型的List首先可以确保结果集的数量可以动态调整而不用事先定义，另外每个结果的数组长度也可各不相同。</p>
<p>再有的改造就是在Segment类中使用了链表结构处理结果，这大大简化了原有ICTCLAS中的数组结构带来的种种问题。</p>
<p><font color="#800080"><strong>3）大量使用了静态方法</strong></font></p>
<p>由于某些过程的调用根本不需要建立对象，这些过程仅仅完成例行计算而已，因此将这些方法声明为静态方法更合适，何况静态方法的调用效率比实例方法高。因此本人在将ICTCLAS移植到C#平台上时，将尽可能的方法定义成静态方法。</p>
<p>下面我就说说SharpICTCLAS中Segment类的一些主要内容：</p>
<h3>1、主体部分</h3>
<p>比较典型的一个运算过程可以参考BiSegment方法，代码（经过简化）如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">Segment类的BiSegment方法</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> BiSegment(<span style="color: #0000ff">string</span> sSentence, <span style="color: #0000ff">double</span> smoothPara, <span style="color: #0000ff">int</span> nKind) <br />
{ <br />
&nbsp;&nbsp; WordResult[] tmpResult; <br />
&nbsp;&nbsp; WordLinkedArray linkedArray; <br />
&nbsp;&nbsp; m_pWordSeg = <span style="color: #0000ff">new</span> List&lt;WordResult[]&gt;(); <br />
&nbsp;&nbsp; m_graphOptimum = <span style="color: #0000ff">new</span> RowFirstDynamicArray&lt;ChainContent&gt;(); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---原子分词 </span><br />
&nbsp;&nbsp; <font color="#ff0000">atomSegment = AtomSegment(sSentence); </font><br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---检索词库，加入所有可能分词方案并存入链表结构 </span><br />
&nbsp;&nbsp; segGraph = GenerateWordNet(atomSegment, coreDict); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---检索所有可能的两两组合 </span><br />
&nbsp;&nbsp; biGraphResult = BiGraphGenerate(segGraph, smoothPara, biDict, coreDict); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---N 最短路径计算出多个分词方案 </span><br />
&nbsp;&nbsp; NShortPath.Calculate(biGraphResult, nKind); <br />
&nbsp;&nbsp; List&lt;<span style="color: #0000ff">int</span>[]&gt; spResult = NShortPath.GetNPaths(Predefine.MAX_SEGMENT_NUM); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---对结果进行优化，例如合并日期等工作 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> i = 0; i &lt; spResult.Count; i++) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff0000">linkedArray = BiPath2LinkedArray(spResult[i], segGraph, atomSegment); </font><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; tmpResult = GenerateWord(spResult[i], linkedArray, m_graphOptimum); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (tmpResult != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg.Add(tmpResult); <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">return</span> m_pWordSeg.Count; <br />
}</div>
</div>
<p>从上面代码可以看出，已经将原有ICTCLAS的原子分词功能合并入Segment类了。</p>
<p>就拿&#8220;<font color="#0000ff">他在1月份大会上说的确实在理</font>&#8221;这句话来说，上面几个步骤得到的中间结果如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">程序</div>
</div>
<div class="content"><span style="color: #008000">//==== 原始句子： </span><br />
<br />
他在1月份大会上说的确实在理 <br />
<br />
<br />
<span style="color: #008000">//==== 原子切分： </span><br />
<br />
始##始, 他, 在, 1, 月, 份, 大, 会, 上, 说, 的, 确, 实, 在, 理, 末##末, <br />
<br />
<br />
<span style="color: #008000">//==== 生成 segGraph： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp; 19823.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:他 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp;&nbsp; 1900.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:月 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:月份 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp; 1234.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp; 14536.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 1333.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 6136.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 469.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp; 23706.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:上 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp; 17649.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 210.00,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确 <br />
row: 11,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 181.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 361.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 357.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实 <br />
row: 12,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 295.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实在 <br />
row: 13,&nbsp; col: 14,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row: 13,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.00,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理 <br />
row: 14,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 129.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理 <br />
row: 15,&nbsp; col: 16,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末 <br />
<br />
<br />
<span style="color: #008000">//==== 生成 biSegGraph： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.37,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始@他 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.37,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:他@在 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.74,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@未##数 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight: -27898.79,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数@月 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 5,&nbsp; eWeight: -27898.75,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数@月份 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.33,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:月@份 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.83,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:月份@大 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.76,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份@大 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.83,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:月份@大会 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.76,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份@大会 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.30,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大@会 <br />
row:&nbsp; 7,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.30,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大@会上 <br />
row:&nbsp; 8,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.11,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会@上 <br />
row:&nbsp; 9,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8.16,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会@上 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.42,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上@说 <br />
row: 11,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.07,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:上@说 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.05,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说@的 <br />
row: 12,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.11,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说@的确 <br />
row: 13,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.10,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的@确 <br />
row: 13,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.10,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的@确实 <br />
row: 14,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.49,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确@实 <br />
row: 15,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.63,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确@实 <br />
row: 14,&nbsp; col: 18,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.49,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确@实在 <br />
row: 15,&nbsp; col: 18,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.63,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确@实在 <br />
row: 16,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.92,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在 <br />
row: 17,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.98,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实@在 <br />
row: 16,&nbsp; col: 20,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.97,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在理 <br />
row: 17,&nbsp; col: 20,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.98,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实@在理 <br />
row: 18,&nbsp; col: 21,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.17,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实在@理 <br />
row: 19,&nbsp; col: 21,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.62,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@理 <br />
row: 20,&nbsp; col: 22,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 14.30,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理@末##末 <br />
row: 21,&nbsp; col: 22,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.95,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理@末##末 <br />
<br />
<br />
<span style="color: #008000">//==== NShortPath 初步切分的到的 N 个结果： </span><br />
<br />
始##始, 他, 在, 1, 月份, 大会, 上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, 1, 月份, 大会, 上, 说, 的, 确实, 在理, 末##末, <br />
始##始, 他, 在, 1, 月份, 大, 会上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, 1, 月, 份, 大会, 上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, 1, 月份, 大, 会上, 说, 的, 确实, 在理, 末##末, <br />
<br />
<br />
<span style="color: #008000">//==== 经过数字、日期合并等策略处理后的 N 个结果： </span><br />
<br />
始##始, 他, 在, <font color="#ff0000">1月份</font>, 大会, 上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, <font color="#ff0000">1月份</font>, 大会, 上, 说, 的, 确实, 在理, 末##末, <br />
始##始, 他, 在, <font color="#ff0000">1月份</font>, 大, 会上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, <font color="#ff0000">1月</font>, 份, 大会, 上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, <font color="#ff0000">1月份</font>, 大, 会上, 说, 的, 确实, 在理, 末##末, <br />
</div>
</div>
<p>这些内容在前面的文章中已经涉及过，我这里主要说说SharpICTCLAS中两处地方的内容，分别是原子分词以及数字日期合并策略。</p>
<h3>2、原子分词</h3>
<p>原子分词看起来应当是程序中最简单的部分，无非是将汉字逐一分开。但是也是最值得改进的地方。SharpICTCLAS目前仍然沿用了原有ICTCLAS的算法并做了微小调整。但我对于 这种原子分词方法不太满意，如果有机会，可以考虑使用一系列正则表达式将某些&#8220;原子&#8221;词单独摘出来。比如&#8220;甲子&#8221;、&#8220;乙亥&#8221;等年份信息属于原子信息，还有URL、Email等都可以预先进行原子识别，这可以大大简化后续工作。因此日后可以考虑这方面的处理。</p>
<h3>3、对结果的处理</h3>
<p>ICTCLAS与SharpICTCLAS都通过NShortPath计算最短路径并将结果以数组的方式进行输出，数组仅仅记录了分词的位置，我们还需要通过一些后续处理手段将这些数组转换成&#8220;分词&#8221;结果。</p>
<p>原有ICTCLAS的实现如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">ICTCLAS对NShortPath结果的处理</div>
</div>
<div class="content"><span style="color: #0000ff">while</span> (i &lt; m_nSegmentCount) <br />
{ <br />
&nbsp; BiPath2UniPath(nSegRoute[i]);&nbsp; <span style="color: #008000">//Path convert to unipath </span><br />
&nbsp; GenerateWord(nSegRoute, i);&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Gernerate word according the Segmentation route </span><br />
&nbsp; i++; <br />
}</div>
</div>
<p>其中这个BiPath2UniPath方法做的工作可以用如下案例说明：</p>
<div class="code">
<div class="content">将BiPath转换为UniPath <br />
例如&#8220;他说的确实在理&#8221; <br />
<br />
BiPath：（0, 1, 2, 3, 6, 9, 11, 12） <br />
&nbsp;&nbsp; 0&nbsp;&nbsp; 1&nbsp;&nbsp; 2&nbsp;&nbsp; 3&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp; 5&nbsp;&nbsp; 6&nbsp;&nbsp;&nbsp; 7&nbsp;&nbsp; 8&nbsp;&nbsp;&nbsp; 9&nbsp;&nbsp; 10&nbsp;&nbsp; 11&nbsp; 12 <br />
始##始&nbsp; 他&nbsp; 说&nbsp; 的&nbsp; 的确&nbsp; 确&nbsp; 确实&nbsp; 实&nbsp; 实在&nbsp; 在&nbsp; 在理&nbsp; 理&nbsp; 末##末 <br />
<br />
经过转换后 <br />
UniPath：（0, 1, 2, 3, 4, 6, 7, 8） <br />
&nbsp;&nbsp; 0&nbsp;&nbsp; 1&nbsp;&nbsp; 2&nbsp;&nbsp; 3&nbsp; 4&nbsp;&nbsp; 5&nbsp;&nbsp; 6&nbsp;&nbsp; 7&nbsp;&nbsp; 8 <br />
始##始&nbsp; 他&nbsp; 说&nbsp; 的&nbsp; 确&nbsp; 实&nbsp; 在&nbsp; 理&nbsp; 末##末 <br />
</div>
</div>
<p>由此可见UniPath记录了针对原子分词的分割位置。而后面的GenerateWord方法又针对这个数组去做合并及优化工作。</p>
<p>本人在SharpICTCLAS的改造过程中发现在这里数组的表述方式给后续工作带来了很大的困难（可以考虑一下，让你合并链表中两个相邻结点简单呢还是数组中两个相邻结点简单？），所以我决定在SharpICTCLAS中将BiPath转换为链表结构供后续使用，实践证明简化了不少工作。</p>
<p>这点在BiSegment方法中有所体现，如下：</p>
<p><code>linkedArray = BiPath2LinkedArray(spResult[i], segGraph, atomSegment); </code></p>
<p>这样改造后，还使得原有ICTCLAS中<code> int *m_npWordPosMapTable; </code>不再需要，与其相关的代码也可以一并删除了。</p>
<h3>4、日期、数字合并策略</h3>
<p>数字、日期等合并以及拆分策略的实施是在GenerateWord方法中实现的，原有ICTCLAS中，该方法是一个超级庞大的方法，里面有不下6、7层的if嵌套、while嵌套等，分析其内部功能的工作异常复杂。经过一番研究后，我将其中的主要功能部分提取出来，改用了&#8220;管道&#8221;方式进行处理，简化了代码复杂度。但对于部分逻辑结构异常复杂的日期时间识别功能，SharpICTCLAS中仍然保留了绝大多数原始内容。</p>
<p>让我们先来看看原始ICTCLAS的GenerateWord方法（超级长的一个方法）：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">ICTCLAS中GenerateWord方法</div>
</div>
<div class="content"><span style="color: #008000">//Generate Word according the segmentation route </span><br />
<span style="color: #0000ff">bool</span> CSegment::GenerateWord(<span style="color: #0000ff">int</span> **nSegRoute, <span style="color: #0000ff">int</span> nIndex) <br />
{ <br />
&nbsp; unsigned <span style="color: #0000ff">int</span> i = 0, k = 0; <br />
&nbsp; <span style="color: #0000ff">int</span> j, nStartVertex, nEndVertex, nPOS; <br />
&nbsp; <span style="color: #0000ff">char</span> sAtom[WORD_MAXLENGTH], sNumCandidate[100], sCurWord[100]; <br />
&nbsp; ELEMENT_TYPE fValue; <br />
&nbsp; <span style="color: #0000ff">while</span> (nSegRoute[nIndex][i] !=&nbsp; - 1 &amp;&amp; nSegRoute[nIndex][i + 1] !=&nbsp; - 1 &amp;&amp; <br />
&nbsp;&nbsp;&nbsp; nSegRoute[nIndex][i] &lt; nSegRoute[nIndex][i + 1]) <br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; nStartVertex = nSegRoute[nIndex][i]; <br />
&nbsp;&nbsp;&nbsp; j = nStartVertex; <span style="color: #008000">//Set the start vertex </span><br />
&nbsp;&nbsp;&nbsp; nEndVertex = nSegRoute[nIndex][i + 1]; <span style="color: #008000">//Set the end vertex </span><br />
&nbsp;&nbsp;&nbsp; nPOS = 0; <br />
&nbsp;&nbsp;&nbsp; m_graphSeg.m_segGraph.GetElement(nStartVertex, nEndVertex, &amp;fValue, &amp;nPOS); <br />
&nbsp;&nbsp;&nbsp; sAtom[0] = 0; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (j &lt; nEndVertex) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Generate the word according the segmentation route </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcat(sAtom, m_graphSeg.m_sAtom[j]); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; j++; <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[0] = 0; <span style="color: #008000">//Init the result ending </span><br />
&nbsp;&nbsp;&nbsp; strcpy(sNumCandidate, sAtom); <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (sAtom[0] != 0 &amp;&amp; (IsAllNum((unsigned <span style="color: #0000ff">char</span>*)sNumCandidate) || <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; IsAllChineseNum(sNumCandidate))) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Merge all seperate continue num into one number </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//sAtom[0]!=0: add in 2002-5-9 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(m_pWordSeg[nIndex][k].sWord, sNumCandidate); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Save them in the result segmentation </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i++; <span style="color: #008000">//Skip to next atom now&nbsp; </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sAtom[0] = 0; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (j &lt; nSegRoute[nIndex][i + 1]) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Generate the word according the segmentation route </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcat(sAtom, m_graphSeg.m_sAtom[j]); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; j++; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcat(sNumCandidate, sAtom); <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; unsigned <span style="color: #0000ff">int</span> nLen = strlen(m_pWordSeg[nIndex][k].sWord); <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nLen == 4 &amp;&amp; CC_Find(<span style="color: #ff00ff">"第上成&#177;—＋∶&#183;．／"</span>, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord) || nLen == 1 &amp;&amp; strchr(<span style="color: #ff00ff">"+-./"</span>, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[0])) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Only one word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, m_pWordSeg[nIndex][k].sWord); <span style="color: #008000">//Record current word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i--; <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <span style="color: #0000ff">if</span> (m_pWordSeg[nIndex][k].sWord[0] == 0) <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Have never entering the while loop </span><br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(m_pWordSeg[nIndex][k].sWord, sAtom); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Save them in the result segmentation </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, sAtom); <span style="color: #008000">//Record current word </span><br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//It is a num </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (strcmp(<span style="color: #ff00ff">"－－"</span>, m_pWordSeg[nIndex][k].sWord) == 0 || strcmp(<span style="color: #ff00ff">"—"</span>, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord) == 0 || m_pWordSeg[nIndex][k].sWord[0] == <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '-' &amp;&amp; m_pWordSeg[nIndex][k].sWord[1] == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The delimiter "－－" </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS = 30464; <span style="color: #008000">//'w'*256;Set the POS with 'w' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i--; <span style="color: #008000">//Not num, back to previous word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Adding time suffix </span><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">char</span> sInitChar[3]; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; unsigned <span style="color: #0000ff">int</span> nCharIndex = 0; <span style="color: #008000">//Get first char </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sInitChar[nCharIndex] = m_pWordSeg[nIndex][k].sWord[nCharIndex]; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (sInitChar[nCharIndex] &lt; 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCharIndex += 1; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sInitChar[nCharIndex] = m_pWordSeg[nIndex][k].sWord[nCharIndex]; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCharIndex += 1; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sInitChar[nCharIndex] = '\0'; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (k &gt; 0 &amp;&amp; (abs(m_pWordSeg[nIndex][k - 1].nHandle) == 27904 || abs <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (m_pWordSeg[nIndex][k - 1].nHandle) == 29696) &amp;&amp; (strcmp(sInitChar,&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #ff00ff">"—"</span>) == 0 || sInitChar[0] == '-') &amp;&amp; (strlen <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (m_pWordSeg[nIndex][k].sWord) &gt; nCharIndex)) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//3-4月&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //27904='m'*256 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Split the sInitChar from the original word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(m_pWordSeg[nIndex][k + 1].sWord, m_pWordSeg[nIndex][k].sWord + <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCharIndex); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k + 1].dValue = m_pWordSeg[nIndex][k].dValue; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k + 1].nHandle = 27904; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nCharIndex] = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].dValue = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].nHandle = 30464; <span style="color: #008000">//'w'*256; </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_graphOptimum.SetElement(nStartVertex, nStartVertex + 1, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].dValue, m_pWordSeg[nIndex][k].nHandle, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nStartVertex += 1; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; k += 1; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nLen = strlen(m_pWordSeg[nIndex][k].sWord); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> ((strlen(sAtom) == 2 &amp;&amp; CC_Find(<span style="color: #ff00ff">"月日时分秒"</span>, sAtom)) || strcmp <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (sAtom, <span style="color: #ff00ff">"月份"</span>) == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//2001年 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcat(m_pWordSeg[nIndex][k].sWord, sAtom); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##时"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 29696; <span style="color: #008000">//'t'*256;//Set the POS with 'm' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <span style="color: #0000ff">if</span> (strcmp(sAtom, <span style="color: #ff00ff">"年"</span>) == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (IsYearTime(m_pWordSeg[nIndex][k].sWord)) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//strncmp(sAtom,"年",2)==0&amp;&amp; </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//1998年， </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcat(m_pWordSeg[nIndex][k].sWord, sAtom); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##时"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 29696; <span style="color: #008000">//Set the POS with 't' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##数"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 27904; <span style="color: #008000">//Set the POS with 'm' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i--; <span style="color: #008000">//Can not be a time word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//早晨/t&nbsp; 五点/t&nbsp; </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (strcmp(m_pWordSeg[nIndex][k].sWord + strlen <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (m_pWordSeg[nIndex][k].sWord) - 2, <span style="color: #ff00ff">"点"</span>) == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##时"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 29696; <span style="color: #008000">//Set the POS with 't' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (!CC_Find(<span style="color: #ff00ff">"∶&#183;．／"</span>, m_pWordSeg[nIndex][k].sWord + nLen - 2) &amp;&amp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nLen - 1] != '.' &amp;&amp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nLen - 1] != '/') <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##数"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 27904; <span style="color: #008000">//'m'*256;Set the POS with 'm' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <span style="color: #0000ff">if</span> (nLen &gt; strlen(sInitChar)) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Get rid of . example 1. </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (m_pWordSeg[nIndex][k].sWord[nLen - 1] == '.' || <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nLen - 1] == '/') <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nLen - 1] = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nLen - 2] = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##数"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 27904; <span style="color: #008000">//'m'*256;Set the POS with 'm' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i--; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i--; <span style="color: #008000">//Not num, back to previous word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fValue = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nEndVertex = nSegRoute[nIndex][i + 1]; <span style="color: #008000">//Ending POS changed to latter </span><br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].nHandle = nPOS; <span style="color: #008000">//Get the POS of current word </span><br />
&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].dValue = fValue;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//(int)(MAX_FREQUENCE*exp(-fValue));//Return the frequency of current word </span><br />
&nbsp;&nbsp;&nbsp; m_graphOptimum.SetElement(nStartVertex, nEndVertex, fValue, nPOS, sCurWord); <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Generate optimum segmentation graph according the segmentation result </span><br />
&nbsp;&nbsp;&nbsp; i++; <span style="color: #008000">//Skip to next atom </span><br />
&nbsp;&nbsp;&nbsp; k++; <span style="color: #008000">//Accept next word </span><br />
&nbsp; } <br />
&nbsp; m_pWordSeg[nIndex][k].sWord[0] = 0; <br />
&nbsp; m_pWordSeg[nIndex][k].nHandle =&nbsp; - 1; <span style="color: #008000">//Set ending </span><br />
&nbsp; <span style="color: #0000ff">return</span> <span style="color: #0000ff">true</span>; <br />
}</div>
</div>
<p>SharpICTCLAS中，对这段超长代码进行了功能剥离，采用一种&#8220;流水线&#8221;式的处理流程，不同工作部分负责处理不同功能，而将处理结果节节传递（很象设计模式中的职责链模式），这样使得整体结构变的清晰起来。SharpICTCLAS中GenerateWord方法定义如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">SharpICTCLAS中的GenerateWord方法</div>
</div>
<div class="content"><span style="color: #0000ff">private</span> <span style="color: #0000ff">static</span> WordResult[] GenerateWord(<span style="color: #0000ff">int</span>[] uniPath, WordLinkedArray linkedArray,&nbsp; <br />
&nbsp;&nbsp; RowFirstDynamicArray&lt;ChainContent&gt; m_graphOptimum) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">if</span> (linkedArray.Count == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> <span style="color: #0000ff">null</span>; <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//Merge all seperate continue num into one number </span><br />
&nbsp;&nbsp; MergeContinueNumIntoOne(<span style="color: #0000ff">ref</span> linkedArray); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//The delimiter "－－" </span><br />
&nbsp;&nbsp; ChangeDelimiterPOS(<span style="color: #0000ff">ref</span> linkedArray); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//如果前一个词是数字，当前词以&#8220;－&#8221;或&#8220;-&#8221;开始，并且不止这一个字符， </span><br />
&nbsp;&nbsp; <span style="color: #008000">//那么将此&#8220;－&#8221;符号从当前词中分离出来。 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//例如 &#8220;3 / -4 / 月&#8221;需要拆分成&#8220;3 / - / 4 / 月&#8221; </span><br />
&nbsp;&nbsp; SplitMiddleSlashFromDigitalWords(<span style="color: #0000ff">ref</span> linkedArray); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//1、如果当前词是数字，下一个词是&#8220;月、日、时、分、秒、月份&#8221;中的一个，则合并,且当前词词性是时间 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//2、如果当前词是可以作为年份的数字，下一个词是&#8220;年&#8221;，则合并，词性为时间，否则为数字。 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//3、如果最后一个汉字是"点" ，则认为当前数字是时间 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//4、如果当前串最后一个汉字不是"∶&#183;．／"和半角的'.''/'，那么是数 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//5、当前串最后一个汉字是"∶&#183;．／"和半角的'.''/'，且长度大于1，那么去掉最后一个字符。例如"1." </span><br />
&nbsp;&nbsp; <font color="#ff0000">CheckDateElements</font>(<span style="color: #0000ff">ref</span> linkedArray); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//遍历链表输出结果 </span><br />
&nbsp;&nbsp; WordResult[] result = <span style="color: #0000ff">new</span> WordResult[linkedArray.Count]; <br />
<br />
&nbsp;&nbsp; WordNode pCur = linkedArray.first; <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> i = 0; <br />
&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pCur != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WordResult item = <span style="color: #0000ff">new</span> WordResult(); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; item.sWord = pCur.theWord.sWord; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; item.nPOS = pCur.theWord.nPOS; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; item.dValue = pCur.theWord.dValue; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; result[i] = item; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_graphOptimum.SetElement(pCur.row, pCur.col, <span style="color: #0000ff">new</span> ChainContent(item.dValue, item.nPOS, pCur.sWordInSegGraph)); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur.next; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i++; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">return</span> result; <br />
}</div>
</div>
<p>从中可以看到linkedArray作为&#8220;绣球&#8221;在多个处理流程中被传递和加工，最终输出相应的结果。只是CheckDateElement方法内容涉及到的东西太多，因此目前看来其实现仍有些臃肿，日后可以进一步进行功能的剥离。</p>
<p>　</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>1）Segment类是SharpICTCLAS中最大的一个类，实现了分词过程中一些关键的步骤。</p>
<p>2）Segment类对原有ICTCLAS中的代码做了大量修改，力争通过新的数据结构简化原有操作。</p>
<p>3）Segment中定义了部分静态方法以提高调用效率。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
 <img src ="http://www.blogjava.net/jiangyz/aggbug/171311.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:18 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171311.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(5)NShortPath-2(转)</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171309.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:05:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171309.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171309.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171309.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171309.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171309.html</trackback:ping><description><![CDATA[<p>在了解了1-最短路径的计算方式后，我们看看N-最短路径的计算。</p>
<p>N-最短路径的计算方式与1-最短路径基本相同，只是在记录所有可达路径时，要保留最短的前N个结果。让我们仍然以上篇文章的案例来看看如何实现N-最短路径的运算。</p>
<h3>1、数据表示</h3>
<p>这里我们仍然沿用前文例子，对下图求N-最短路径，每条边的权重已经在图中标注出来了。</p>
<p><img height="107" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308002.gif" width="383" border="0" /></p>
<p>（图一）</p>
<h3>2、运算过程</h3>
<p>仍然象1-最短路径一样，计算出每个结点上可达N-最短路的PreNode。我们这里以2-最短路径为例：</p>
<p>1）首先计算出每个结点上所有可达路径的可能路径长度并按从小到大排序。</p>
<p>2）根据排序结果取前2种路径长度并分别记录进各结点的PreNode队列。如下图：</p>
<p><img height="278" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308008.gif" width="456" border="0" /></p>
<p>（图二）</p>
<p>在该图中，到达1号、2号、3号结点的路径虽然有多条，但长度只有一种长度，但到达4号&#8220;D&#8221;结点的路径长度有两种，即长度可能是3也可能是4，此时在&#8220;最短路&#8221;处（index＝0）记录长度为3时的PreNode，在&#8220;次短路&#8221;处（index＝1）处记录长度为4时的PreNode，依此类推。</p>
<p>值得注意的是，此时用来记录PreNode的坐标已经由前文求&#8220;1-最短路径&#8221;时的一个数（<font color="#0000ff">ParentNode值</font>)变为2个数（<font color="#0000ff">ParentNode值以及index值</font>）。</p>
<p>如图二所示，到达6号&#8220;末&#8221;结点的次短路径由两个ParentNode，一个是index=0中的4号结点，一个是index=1的5号结点，它们都使得总路径长度为6。</p>
<h3>3、具体实现</h3>
<p>在具体实现上述算法时，首先要求得所有可能路径的长度，这在SharpICTCLAS中是通过一个EnQueueCurNodeEdges方法实现的，上篇文章给出了它的简化版本的代码，这里将完整的求N-最短路径的EnQueueCurNodeEdges方法代码放上来：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">程序</div>
</div>
<div class="content"><span style="color: #008000">//==================================================================== </span><br />
<span style="color: #008000">// 将所有到当前结点（nCurNode）可能的边根据eWeight排序并压入队列 </span><br />
<span style="color: #008000">//==================================================================== </span><br />
<span style="color: #0000ff">private</span> <span style="color: #0000ff">static</span> <span style="color: #0000ff">void</span> EnQueueCurNodeEdges(<span style="color: #0000ff">ref</span> CQueue queWork, <span style="color: #0000ff">int</span> nCurNode) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> nPreNode; <br />
&nbsp;&nbsp; <span style="color: #0000ff">double</span> eWeight; <br />
&nbsp;&nbsp; ChainItem&lt;ChainContent&gt; pEdgeList; <br />
<br />
&nbsp;&nbsp; queWork.Clear(); <br />
&nbsp;&nbsp; pEdgeList = m_apCost.GetFirstElementOfCol(nCurNode); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">// Get all the edges </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pEdgeList != <span style="color: #0000ff">null</span> &amp;&amp; pEdgeList.col == nCurNode) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPreNode = pEdgeList.row;&nbsp; <span style="color: #008000">// </span><font color="#ff0000">很特别的命令，利用了row与col的关系</font><span style="color: #008000"> </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; eWeight = pEdgeList.Content.eWeight; <span style="color: #008000">//Get the eWeight of edges </span><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> i = 0; i &lt; <font color="#ff0000">m_nValueKind</font>; i++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// 第一个结点，没有PreNode，直接加入队列 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nPreNode == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queWork.EnQueue(<span style="color: #0000ff">new</span> QueueElement(nPreNode, i, eWeight)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// 如果PreNode的Weight == Predefine.INFINITE_VALUE，则没有必要继续下去了 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (m_pWeight[nPreNode - 1][i] == Predefine.INFINITE_VALUE) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queWork.EnQueue(<span style="color: #0000ff">new</span> QueueElement(nPreNode, i, eWeight + m_pWeight[nPreNode - 1][i])); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pEdgeList = pEdgeList.next; <br />
&nbsp;&nbsp; } <br />
}</div>
</div>
<p>这里的m_nValueKind就是你希望N-最短路径保留几种路径的结果。</p>
<p>当m_nValueKind＝2时，我们求得了2-最短路径，路径长度有两种，分别长度为5和6，而路径总共有6条，如下：</p>
<p>最短路径：</p>
<ul>
    <li><font color="#0000ff">0, 1, 3, 6,</font>
    <li><font color="#0000ff">0, 1, 2, 3, 6,</font>
    <li><font color="#0000ff">0, 1, 2, 4, 5, 6,</font> </li>
</ul>
<p>========================</p>
<p>次短路径</p>
<ul>
    <li><font color="#0000ff">0, 1, 2, 4, 6,</font>
    <li><font color="#0000ff">0, 1, 3, 4, 5, 6,</font>
    <li><font color="#0000ff">0, 1, 2, 3, 4, 5, 6,</font> </li>
</ul>
<h3>4、求解N-最短路径</h3>
<p>N-最短路径的最终输出与上篇文章完全一致，仍然是借助堆栈完成的。只不过根据index的取值的不同，分多次完成压栈与出栈的操作而已。此处就不再重复，感兴趣的可以再看看上一篇文章。</p>
<p>　</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>1）N-最短路径中用来记录PreNode的坐标由前文求&#8220;1-最短路径&#8221;时的一个数（<font color="#0000ff">ParentNode值</font>)变为2个数（<font color="#0000ff">ParentNode值以及index值</font>）。</p>
<p>2）N-最短路径并不意味着求得得路径只有N条。</p>
<p>3）文中只演示了2-最短路径，但可以推广到N-最短路径。程序求得的3-最短路径中，最长的路径为：（0, 1, 3, 4, 6）与（0, 1, 2, 3, 4, 6），它们的长度都是7。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
<img src ="http://www.blogjava.net/jiangyz/aggbug/171309.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:05 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171309.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(4)NShortPath-1（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171299.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 11:38:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171299.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171299.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171299.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171299.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171299.html</trackback:ping><description><![CDATA[<p>N-最短路径中文词语粗分是分词过程中非常重要的一步，而原有ICTCLAS中该部分代码也是我认为最难读懂的部分，到现在还有一些方法没有弄明白，因此我几乎重写了NShortPath类。要想说明N-最短路径代码是如何工作的并不容易，所以分成两步分，本部分先说说SharpICTCLAS中1-最短路径是如何实现的，在下一篇文章中再引申到N-最短路径。</p>
<h3>1、数据表示</h3>
<p>这里我们求最短路的例子使用如下的有向图，每条边的权重已经在图中标注出来了。</p>
<p><img height="107" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308002.gif" width="383" border="0" /></p>
<p>（图一）</p>
<p>根据上篇文章内容，该图该可以等价于如下的二维表格表示：</p>
<p><img height="255" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308003.gif" width="372" border="0" /></p>
<p>（图二）</p>
<p>而对应于该表格的是一个ColumnFirstDynamicArray，共有10个结点，每个结点的取值如下表所示：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">该示例对应的ColumnFirstDynamicArray</div>
</div>
<div class="content">row:0,&nbsp; col:1,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: 始@A <br />
row:1,&nbsp; col:2,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: A@B <br />
row:1,&nbsp; col:3,&nbsp; eWeight:2,&nbsp; nPOS:0,&nbsp; sWord: A@C <br />
row:2,&nbsp; col:3,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: B@C <br />
row:2,&nbsp; col:4,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: B@D <br />
row:3,&nbsp; col:4,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: C@D <br />
row:4,&nbsp; col:5,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: D@E <br />
row:3,&nbsp; col:6,&nbsp; eWeight:2,&nbsp; nPOS:0,&nbsp; sWord: C@末 <br />
row:4,&nbsp; col:6,&nbsp; eWeight:3,&nbsp; nPOS:0,&nbsp; sWord: D@末 <br />
row:5,&nbsp; col:6,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: E@末</div>
</div>
<h3>2、计算出每个结点上可达最短路的PreNode</h3>
<p>在求解N-最短路径之前，先看看如何求最短PreNode。如下图所示：</p>
<p><img height="195" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308004.gif" width="456" border="0" /></p>
<p>（图三）</p>
<p>首先计算出到达每个结点的最短路径，并将该结点的父结点压入该结点所对应的队列。例如3号&#8220;C&#8221;结点，到达该结点的最短路径长度为3，它的Parent结点可以是1号&#8220;A&#8221;结点，也可以是2号&#8220;B&#8221;结点，因此在队列中存储了两个PreNode结点。</p>
<p>而在实际计算时，如何知道到达3号&#8220;C&#8221;结点的路径有几条呢？其实我们首先计算所有到达3号&#8220;C&#8221;结点的路径长度，并按照路径长度从小到大的顺序排列（所有这些都是靠CQueue这个类完成的），然后从队列中依次向后取值，取出所有最短路径对应的PreNode。</p>
<p>计算到当前结点（nCurNode）可能的边，并根据总路径长度由小到大压入队列的代码如下（经过简化）：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">EnQueueCurNodeEdges方法</div>
</div>
<div class="content"><span style="color: #008000">//==================================================================== </span><br />
<span style="color: #008000">// 将所有到当前结点（nCurNode）可能的边根据eWeight排序并压入队列 </span><br />
<span style="color: #008000">//==================================================================== </span><br />
<span style="color: #0000ff">private</span> <span style="color: #0000ff">void</span> EnQueueCurNodeEdges(<span style="color: #0000ff">ref</span> CQueue queWork, <span style="color: #0000ff">int</span> nCurNode) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> nPreNode; <br />
&nbsp;&nbsp; <span style="color: #0000ff">double</span> eWeight; <br />
&nbsp;&nbsp; ChainItem&lt;ChainContent&gt; pEdgeList; <br />
<br />
&nbsp;&nbsp; queWork.Clear(); <br />
&nbsp;&nbsp; pEdgeList = m_apCost.GetFirstElementOfCol(nCurNode); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">// 获取所有到当前结点的边 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pEdgeList != <span style="color: #0000ff">null</span> &amp;&amp; pEdgeList.col == nCurNode) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPreNode = pEdgeList.row;&nbsp; <span style="color: #008000">// </span><font color="#ff0000">很特别的命令，利用了row与col的关系</font><span style="color: #008000"> </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; eWeight = pEdgeList.Content.eWeight; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// 第一个结点，没有PreNode，直接加入队列 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nPreNode == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queWork.EnQueue(<span style="color: #0000ff">new</span> QueueElement(nPreNode, eWeight)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queWork.EnQueue(<span style="color: #0000ff">new</span> QueueElement(nPreNode, eWeight + m_pWeight[nPreNode - 1])); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pEdgeList = pEdgeList.next; <br />
&nbsp;&nbsp; } <br />
} <br />
</div>
</div>
<p>这段代码中有一行很特别的命令，就是用红颜色注释的那句&#8220;nPreNode = pEdgeList.row;&#8221;，让我琢磨了半天终于弄明白原有ICTCLAS用意的一句话。这需要参考本文图二，为了方便起见，我将它挪到了这里：</p>
<p><img height="255" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308003.gif" width="372" border="0" /></p>
<p>注意<strong><font color="#0000ff"> 3 </font></strong>号&#8220;C&#8221;结点在该表中处于第<font color="#0000ff"><strong> 3 </strong></font>列，所有可以到达该结点的<strong><font color="#0000ff">边</font></strong>就是该列中的元素（目前有两个元素&#8220;A@C&#8221;与&#8220;B@C&#8221;）。而与 <strong><font color="#0000ff">3</font></strong> 号&#8220;C&#8221;结点构成这两条边的PreNode结点恰恰是这两个元素的&#8220;<strong><font color="#ff0000">行号</font></strong>&#8221;，分别是 <strong><font color="#ff0000">1 </font></strong>号&#8220;A&#8221;结点与 <strong><font color="#ff0000">2</font></strong> 号&#8220;B&#8221;结点。正是因为这种特殊的对应关系，为我们检索所有可达边提供了便捷的方法。阅读上面那段代码务必把握好这种关系。</p>
<h3>3、求解最短路径</h3>
<p>求出每个结点上最短路径的PreNode后就需要据此推导出完整的最短路径。原ICTCLAS代码中是靠GetPaths方法实现的，只是到现在我也没有读懂这个方法的代码究竟想干什么 ，只知道它用了若干个while，若干个if，若干个嵌套...（将ICTCLAS中的GetPaths放上来，如果谁读懂了，回头给我讲讲 ，感觉应该和我的算法差不多）。</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">NShortPath.cpp程序中的GetPaths方法</div>
</div>
<div class="content"><span style="color: #0000ff">void</span> CNShortPath::GetPaths(unsigned <span style="color: #0000ff">int</span> nNode, unsigned <span style="color: #0000ff">int</span> nIndex, <span style="color: #0000ff">int</span> <br />
&nbsp; **nResult, <span style="color: #0000ff">bool</span> bBest) <br />
{ <br />
&nbsp; CQueue queResult; <br />
&nbsp; unsigned <span style="color: #0000ff">int</span> nCurNode, nCurIndex, nParentNode, nParentIndex, nResultIndex = 0; <br />
<br />
&nbsp; <span style="color: #0000ff">if</span> (m_nResultCount &gt;= MAX_SEGMENT_NUM) <br />
&nbsp; <span style="color: #008000">//Only need 10 result </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> ; <br />
&nbsp; nResult[m_nResultCount][nResultIndex] =&nbsp; - 1; <span style="color: #008000">//Init the result&nbsp; </span><br />
&nbsp; queResult.Push(nNode, nIndex); <br />
&nbsp; nCurNode = nNode; <br />
&nbsp; nCurIndex = nIndex; <br />
&nbsp; <span style="color: #0000ff">bool</span> bFirstGet; <br />
&nbsp; <span style="color: #0000ff">while</span> (!queResult.IsEmpty()) <br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (nCurNode &gt; 0) <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">// </span><br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Get its parent and store them in nParentNode,nParentIndex </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (m_pParent[nCurNode - 1][nCurIndex].Pop(&amp;nParentNode, &amp;nParentIndex, 0, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">false</span>, <span style="color: #0000ff">true</span>) !=&nbsp; - 1) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCurNode = nParentNode; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCurIndex = nParentIndex; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nCurNode &gt; 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queResult.Push(nCurNode, nCurIndex); <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nCurNode == 0) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Get a path and output </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nResult[m_nResultCount][nResultIndex++] = nCurNode; <span style="color: #008000">//Get the first node </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bFirstGet = <span style="color: #0000ff">true</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nParentNode = nCurNode; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (queResult.Pop(&amp;nCurNode, &amp;nCurIndex, 0, <span style="color: #0000ff">false</span>, bFirstGet) !=&nbsp; - 1) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nResult[m_nResultCount][nResultIndex++] = nCurNode; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bFirstGet = <span style="color: #0000ff">false</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nParentNode = nCurNode; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nResult[m_nResultCount][nResultIndex] =&nbsp; - 1; <span style="color: #008000">//Set the end </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_nResultCount += 1; <span style="color: #008000">//The number of result add by 1 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (m_nResultCount &gt;= MAX_SEGMENT_NUM) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Only need 10 result </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> ; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nResultIndex = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nResult[m_nResultCount][nResultIndex] =&nbsp; - 1; <span style="color: #008000">//Init the result&nbsp; </span><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (bBest) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Return the best result, ignore others </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> ; <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; queResult.Pop(&amp;nCurNode, &amp;nCurIndex, 0, <span style="color: #0000ff">false</span>, <span style="color: #0000ff">true</span>); <span style="color: #008000">//Read the top node </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (queResult.IsEmpty() == <span style="color: #0000ff">false</span> &amp;&amp; (m_pParent[nCurNode - <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1][nCurIndex].IsSingle() || m_pParent[nCurNode - 1][nCurIndex].IsEmpty <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (<span style="color: #0000ff">true</span>))) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queResult.Pop(&amp;nCurNode, &amp;nCurIndex, 0); <span style="color: #008000">//Get rid of it </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queResult.Pop(&amp;nCurNode, &amp;nCurIndex, 0, <span style="color: #0000ff">false</span>, <span style="color: #0000ff">true</span>); <span style="color: #008000">//Read the top node </span><br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (queResult.IsEmpty() == <span style="color: #0000ff">false</span> &amp;&amp; m_pParent[nCurNode - <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1][nCurIndex].IsEmpty(<span style="color: #0000ff">true</span>) == <span style="color: #0000ff">false</span>) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pParent[nCurNode - 1][nCurIndex].Pop(&amp;nParentNode, &amp;nParentIndex, 0, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">false</span>, <span style="color: #0000ff">false</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCurNode = nParentNode; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCurIndex = nParentIndex; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nCurNode &gt; 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queResult.Push(nCurNode, nCurIndex); <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp; } <br />
}</div>
</div>
<p>我重写了求解最短路径的方法，其算法表述如下：</p>
<p><img height="199" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308005.gif" width="516" border="0" /></p>
<p>（图四）</p>
<p>1）首先将最后一个元素压入堆栈（本例中是6号结点），什么时候这个元素弹出堆栈，什么时候整个任务结束。</p>
<p>2）对于每个结点的PreNode队列，维护了一个当前指针，初始状态都指向PreNode队列中第一个元素。</p>
<p>3）从右向左依次取出PreNode队列中的当前元素并压入堆栈，并将队列指针重新指向队列中第一个元素。如图四：6号元素PreNode是3，3号元素PreNode是1，1号元素PreNode是0。</p>
<p>4）当第一个元素压入堆栈后，输出堆栈内容即为一条队列。本例中0, 1, 3, 6便是一条最短路径。</p>
<p>5）将堆栈中的内容依次弹出，每弹出一个元素，就将当时压栈时对应的PreNode队列指针下移一格。如果到了末尾无法下移，则继续执行第5步，如果仍然可以移动，则执行第3步。</p>
<p>对于本例，先将&#8220;0&#8221;弹出堆栈，该元素对应的是1号&#8220;A&#8221;结点的PreNode队列，该队列的当前指针已经无法下移，因此继续弹出堆栈中的&#8220;1&#8221; ；该元素对应3号&#8220;C&#8221;结点，因此将3号&#8220;C&#8221;结点对应的PreNode队列指针下移。由于可以移动，因此将队列中的2压入队列，2号&#8220;B&#8221;结点的PreNode是1，因此再压入1，依次类推，直到0被压入，此时又得到了一条最短路径，那就是0，1，2，3，6。如下图：</p>
<p><img height="196" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308006.gif" width="512" border="0" /></p>
<p>（图五）</p>
<p>再往下，0、1、2都被弹出堆栈，3被弹出堆栈后，由于它对应的6号元素PreNode队列记录指针仍然可以下移，因此将5压入堆栈并依次将其PreNode入栈，直到0被入栈。此时输出第3条最短路径：0, 1, 2, 4, 5, 6。入下图：</p>
<p><img height="195" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308007.gif" width="512" border="0" /></p>
<p>（图六）</p>
<p>输出完成后，紧接着又是出栈，此时已经没有任何堆栈元素对应的PreNode队列指针可以下移，于是堆栈中的最后一个元素6也被弹出堆栈，此时输出工作完全结束。我们得到了3条最短路径，分别是：</p>
<ul>
    <li><font color="#0000ff">0, 1, 3, 6,</font>
    <li><font color="#0000ff">0, 1, 2, 3, 6,</font>
    <li><font color="#0000ff">0, 1, 2, 4, 5, 6,</font> </li>
</ul>
<p>让我们看看在SharpICTCLAS中，该算法是如何实现的：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">SharpICTCLAS中的GetPaths方法</div>
</div>
<div class="content"><span style="color: #008000">//==================================================================== </span><br />
<span style="color: #008000">// 注：index ＝ 0 : 最短的路径； index = 1 ： 次短的路径 </span><br />
<span style="color: #008000">//&nbsp;&nbsp;&nbsp;&nbsp; 依此类推。index &lt;= this.m_nValueKind </span><br />
<span style="color: #008000">//==================================================================== </span><br />
<span style="color: #0000ff">public</span> List&lt;<span style="color: #0000ff">int</span>[]&gt; GetPaths(<span style="color: #0000ff">int</span> index) <br />
{ <br />
&nbsp;&nbsp; Stack&lt;PathNode&gt; stack = <span style="color: #0000ff">new</span> Stack&lt;PathNode&gt;(); <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> curNode = m_nNode - 1, curIndex = index; <br />
&nbsp;&nbsp; QueueElement element; <br />
&nbsp;&nbsp; PathNode node; <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span>[] aPath; <br />
&nbsp;&nbsp; List&lt;<span style="color: #0000ff">int</span>[]&gt; result = <span style="color: #0000ff">new</span> List&lt;<span style="color: #0000ff">int</span>[]&gt;(); <br />
<br />
&nbsp;&nbsp; element = m_pParent[curNode - 1][curIndex].GetFirst(); <br />
&nbsp;&nbsp; <span style="color: #0000ff">while</span> (element != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// ---------- 通过压栈得到路径 ----------- </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; stack.Push(<span style="color: #0000ff">new</span> PathNode(curNode, curIndex)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; stack.Push(<span style="color: #0000ff">new</span> PathNode(element.nParent, element.nIndex)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; curNode = element.nParent; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (curNode != 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; element = m_pParent[element.nParent - 1][element.nIndex].GetFirst(); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; stack.Push(<span style="color: #0000ff">new</span> PathNode(element.nParent, element.nIndex)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; curNode = element.nParent; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// -------------- 输出路径 -------------- </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PathNode[] nArray = stack.ToArray();&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; aPath = <span style="color: #0000ff">new</span> <span style="color: #0000ff">int</span>[nArray.Length]; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span>(<span style="color: #0000ff">int</span> i=0; i&lt;aPath.Length; i++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; aPath[i] = nArray[i].nParent; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; result.Add(aPath); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// -------------- 出栈以检查是否还有其它路径 -------------- </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">do</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; node = stack.Pop(); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; curNode = node.nParent; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; curIndex = node.nIndex; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <span style="color: #0000ff">while</span> (curNode &lt; 1 || (stack.Count != 0 &amp;&amp; !m_pParent[curNode - 1][curIndex].CanGetNext)); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; element = m_pParent[curNode - 1][curIndex].GetNext(); <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">return</span> result; <br />
}</div>
</div>
<p>注意，上面的代码是N-最短路径的，比起1-最短路径来说增加了点复杂度，但总体架构不变。这段代码将原有ICTCLAS的70多行求解路径代码缩短到了40多行。</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>1）N-最短路径的求解比较复杂，本文先从求解1-最短路径着手，说明SharpICTCLAS是如何计算的，在下篇文章中将推广到N-最短路径。</p>
<p>2）1-最短路径并不意味着只有一条最短路径，而是路径最短的若干条路径。就如本文案例所示，1-最短路径算法最终求得了3条路径，它们的长度都是5，因此都是最短路径。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
 <img src ="http://www.blogjava.net/jiangyz/aggbug/171299.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 19:38 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171299.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(3)DynamicArray（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171295.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 11:31:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171295.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171295.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171295.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171295.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171295.html</trackback:ping><description><![CDATA[<p>从前文可以看出，ICTCLAS中DynamicArray类在初步分词过程中起到了至关重要的所用，而ICTCLAS中DynamicArray类的实现比较复杂，可以说是包罗万象，在一个GetElement方法就综合考虑了1）row优先排序的链表；2）col优先排序的链表；3）当nRow为-1时的行为；4）当nCol为-1时的行为；5）当nRow与nCol都不为-1时的行为 （可以参考本人的《天书般的ICTCLAS分词系统代码(一)》一文）。</p>
<p>为了简化编程接口，并将纠缠不清的代码逻辑剥离开来，我重新设计了DynamicArray类，利用三个类实现原有一个类的功能。具体改造包括：1） 将DynamicArray类做成一个抽象父类，实现一些公共功能；2）设计了RowFirstDynamicArray类与ColumnFirstDynaimcArray类作为DynamicArray的子类，分别实现row优先排序和col优先排序的DynamicArray。2） 在牺牲有限性能的同时力争大幅度简化代码的复杂度，提高可读性。</p>
<p>具体实现如下：</p>
<h3>1、DynamicArray链表结点的定义</h3>
<p>为了使得DynamicArray更具有通用性，使用了范型方式定义了链表的结点，代码如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">DynamicArray链表结点的定义</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> ChainItem&lt;T&gt; <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> row; <br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> col; <br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> T Content; <br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> ChainItem&lt;T&gt; next; <br />
}</div>
</div>
<h3>2、DynamicArray类</h3>
<p>DynamicArray类是一个抽象类，主要为RowFirstDynamicArray类与ColumnFirstDynaimcArray类提供公共的基础功能，例如查找行、列值为nRow, nCol的结点等。同时，该类将插入一新结点的方法设计成抽象方法，需要具体类根据row优先排序还是col优先排序自行决定具体实现。DynamicArray类的代码实现如下（应当说非常简单）：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">DynamicArray.cs 程序</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> <span style="color: #0000ff">abstract</span> <span style="color: #0000ff">class</span> DynamicArray&lt;T&gt; <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">protected</span> ChainItem&lt;T&gt; pHead;&nbsp; <span style="color: #008000">//The head pointer of array chain </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> ColumnCount, RowCount;&nbsp; <span style="color: #008000">//The row and col of the array </span><br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> DynamicArray() <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pHead = <span style="color: #0000ff">null</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; RowCount = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ColumnCount = 0; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> ItemCount <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">get</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ChainItem&lt;T&gt; pCur = pHead; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">int</span> nCount = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pCur != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCount++; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur.next; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> nCount; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 查找行、列值为nRow, nCol的结点 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> ChainItem&lt;T&gt; GetElement(<span style="color: #0000ff">int</span> nRow, <span style="color: #0000ff">int</span> nCol) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ChainItem&lt;T&gt; pCur = pHead; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pCur != <span style="color: #0000ff">null</span> &amp;&amp; !(pCur.col == nCol &amp;&amp; pCur.row == nRow)) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur.next; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> pCur; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 设置或插入一个新的结点 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">abstract</span> <span style="color: #0000ff">void</span> SetElement(<span style="color: #0000ff">int</span> nRow, <span style="color: #0000ff">int</span> nCol, T content); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #008000">// Return the head element of ArrayChain </span><br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> ChainItem&lt;T&gt; GetHead() <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> pHead; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #008000">//Get the tail Element buffer and return the count of elements </span><br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> GetTail(<span style="color: #0000ff">out</span> ChainItem&lt;T&gt; pTailRet) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ChainItem&lt;T&gt; pCur = pHead, pPrev = <span style="color: #0000ff">null</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">int</span> nCount = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pCur != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCount++; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pPrev = pCur; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur.next; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pTailRet = pPrev; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> nCount; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #008000">// Set Empty </span><br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">void</span> SetEmpty() <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pHead = <span style="color: #0000ff">null</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ColumnCount = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; RowCount = 0; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">override</span> <span style="color: #0000ff">string</span> ToString() <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; StringBuilder sb = <span style="color: #0000ff">new</span> StringBuilder(); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ChainItem&lt;T&gt; pCur = pHead; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pCur != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sb.Append(<span style="color: #0000ff">string</span>.Format(<span style="color: #ff00ff">"row:{0,3},&nbsp; col:{1,3},&nbsp; "</span>, pCur.row, pCur.col)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sb.Append(pCur.Content.ToString()); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sb.Append(<span style="color: #ff00ff">"\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur.next; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> sb.ToString(); <br />
&nbsp;&nbsp; } <br />
}</div>
</div>
<h3>3、RowFirstDynamicArray类的实现</h3>
<p>RowFirstDynamicArray类主要实现了row优先排序的DynamicArray，里面包含了两个方法：GetFirstElementOfRow（获取row为nRow的第一个元素）和SetElement方法。其中GetFirstElementOfRow有两个重载版本。</p>
<p>这等价于将原有ICTCLAS中GetElement方法拆分成了三个方法，如果算上重载版本的话共五个方法，它们分别是：1）获取行、列值为nRow, nCol的结点，此方法由DynamicArray类实现；2）对于Row优先排序的链表而言，单独提供了GetFirstElementOfRow方法。3）对于Column优先排序的链表而言，单独提供了GetFirstElementOfColumn方法。</p>
<p>RowFirstDynamicArray类的实现如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">RowFirstDynamicArray.cs 程序</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> RowFirstDynamicArray&lt;T&gt; : DynamicArray&lt;T&gt; <br />
{ <br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 查找行为 nRow 的第一个结点 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> ChainItem&lt;T&gt; GetFirstElementOfRow(<span style="color: #0000ff">int</span> nRow) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ChainItem&lt;T&gt; pCur = pHead; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pCur != <span style="color: #0000ff">null</span> &amp;&amp; pCur.row != nRow) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur.next; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> pCur; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 从 startFrom 处向后查找行为 nRow 的第一个结点 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> ChainItem&lt;T&gt; GetFirstElementOfRow(<span style="color: #0000ff">int</span> nRow, ChainItem&lt;T&gt; startFrom) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ChainItem&lt;T&gt; pCur = startFrom; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pCur != <span style="color: #0000ff">null</span> &amp;&amp; pCur.row != nRow) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur.next; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> pCur; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 设置或插入一个新的结点 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//==================================================================== </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">override</span> <span style="color: #0000ff">void</span> SetElement(<span style="color: #0000ff">int</span> nRow, <span style="color: #0000ff">int</span> nCol, T content) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ChainItem&lt;T&gt; pCur = pHead, pPre = <span style="color: #0000ff">null</span>, pNew;&nbsp; <span style="color: #008000">//The pointer of array chain </span><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nRow &gt; RowCount)<span style="color: #008000">//Set the array row </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; RowCount = nRow; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nCol &gt; ColumnCount)<span style="color: #008000">//Set the array col </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ColumnCount = nCol; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pCur != <span style="color: #0000ff">null</span> &amp;&amp; (pCur.row &lt; nRow || (pCur.row == nRow &amp;&amp; pCur.col &lt; nCol))) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pPre = pCur; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur.next; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pCur != <span style="color: #0000ff">null</span> &amp;&amp; pCur.row == nRow &amp;&amp; pCur.col == nCol)<span style="color: #008000">//Find the same position </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur.Content = content;<span style="color: #008000">//Set the value </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pNew = <span style="color: #0000ff">new</span> ChainItem&lt;T&gt;();<span style="color: #008000">//malloc a new node </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pNew.col = nCol; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pNew.row = nRow; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pNew.Content = content; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pNew.next = pCur; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pPre == <span style="color: #0000ff">null</span>)<span style="color: #008000">//link pNew after the pPre </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pHead = pNew; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pPre.next = pNew; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp; } <br />
}</div>
</div>
<p>有关ColumnFirstDynamicArray类的实现大同小异，这里就不再提供代码了。我们此时可以对比一下原有ICTCLAS中GetElement的实现：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage0" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">DynamicArray.cpp</div>
</div>
<div class="content">ELEMENT_TYPE CDynamicArray::GetElement(<span style="color: #0000ff">int</span> nRow, <span style="color: #0000ff">int</span> nCol, PARRAY_CHAIN pStart, <br />
&nbsp; PARRAY_CHAIN *pRet) <br />
{ <br />
&nbsp; PARRAY_CHAIN pCur = pStart; <br />
&nbsp; <span style="color: #0000ff">if</span> (pStart == 0) <br />
&nbsp;&nbsp;&nbsp; pCur = m_pHead; <br />
&nbsp; <span style="color: #0000ff">if</span> (pRet != 0) <br />
&nbsp;&nbsp;&nbsp; *pRet = NULL; <br />
&nbsp; <span style="color: #0000ff">if</span> (nRow &gt; (<span style="color: #0000ff">int</span>)m_nRow || nCol &gt; (<span style="color: #0000ff">int</span>)m_nCol) <br />
&nbsp; <span style="color: #008000">//Judge if the row and col is overflow </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> INFINITE_VALUE; <br />
&nbsp; <span style="color: #0000ff">if</span> (<strong>m_bRowFirst</strong>) <br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (<strong>pCur != NULL &amp;&amp; (nRow !=&nbsp; - 1 &amp;&amp; (<span style="color: #0000ff">int</span>)pCur-&gt;row &lt; nRow || (nCol !=&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - 1 &amp;&amp; (<span style="color: #0000ff">int</span>)pCur-&gt;row == nRow &amp;&amp; (<span style="color: #0000ff">int</span>)pCur-&gt;col &lt; nCol))</strong>) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pRet != 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *pRet = pCur; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur-&gt;next; <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp; } <br />
&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (<strong>pCur != NULL &amp;&amp; (nCol !=&nbsp; - 1 &amp;&amp; (<span style="color: #0000ff">int</span>)pCur-&gt;col &lt; nCol || ((<span style="color: #0000ff">int</span>)pCur <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -&gt;col == nCol &amp;&amp; nRow !=&nbsp; - 1 &amp;&amp; (<span style="color: #0000ff">int</span>)pCur-&gt;row &lt; nRow))</strong>) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pRet != 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *pRet = pCur; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur-&gt;next; <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp; } <br />
&nbsp; <span style="color: #0000ff">if</span> (<strong>pCur != NULL &amp;&amp; ((<span style="color: #0000ff">int</span>)pCur-&gt;row == nRow || nRow ==&nbsp; - 1) &amp;&amp; ((<span style="color: #0000ff">int</span>)pCur <br />
&nbsp;&nbsp;&nbsp; -&gt;col == nCol || nCol ==&nbsp; - 1)</strong>) <br />
&nbsp; <span style="color: #008000">//Find the same position </span><br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Find it and return the value </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pRet != 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *pRet = pCur; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> pCur-&gt;<span style="color: #0000ff">value</span>; <br />
&nbsp; } <br />
&nbsp; <span style="color: #0000ff">return</span> INFINITE_VALUE; <br />
}</div>
</div>
<p>可以看出，将原有GetElement方法拆分成3个方法后，代码得到大大简化，而且逻辑更为清晰了。</p>
<h3>3、性能与代码可读性的权衡</h3>
<p>DynamicArray类为了确保代码的清晰可读，在某些地方做了些调整，让我们对比一下SharpICTCLAS与ICTCLAS中在这方面的不同考虑。下面的代码演示了GetFirstElementOfRow方法在两者之间的不同之处（我特意对ICTCLAS代码做了逻辑上的简化）：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">程序</div>
</div>
<div class="content"><span style="color: #008000">//==================================================================== </span><br />
<span style="color: #008000">// SharpICTCLAS 中的查找行为 nRow 的第一个结点 </span><br />
<span style="color: #008000">//==================================================================== </span><br />
<span style="color: #0000ff">public</span> ChainItem&lt;T&gt; GetFirstElementOfRow(<span style="color: #0000ff">int</span> nRow) <br />
{ <br />
&nbsp;&nbsp; ChainItem&lt;T&gt; pCur = pHead; <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pCur != <span style="color: #0000ff">null</span> &amp;&amp; pCur.row != nRow) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur.next; <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">return</span> pCur; <br />
} <br />
<br />
<span style="color: #008000">//==================================================================== </span><br />
<span style="color: #008000">// ICTCLAS 中的查找行为 nRow 的第一个结点 </span><br />
<span style="color: #008000">//==================================================================== </span><br />
... GetElement(<span style="color: #0000ff">int</span> nRow, <span style="color: #0000ff">int</span> nCol, PARRAY_CHAIN pStart, PARRAY_CHAIN *pRet)&nbsp; <br />
{&nbsp; <br />
&nbsp; PARRAY_CHAIN pCur = pStart;&nbsp; <br />
<br />
&nbsp; <span style="color: #0000ff">while</span> (pCur != NULL &amp;&amp; (pCur-&gt;row &lt; nRow || (pCur-&gt;row == nRow &amp;&amp; pCur-&gt;col &lt; nCol)))&nbsp; <br />
&nbsp; {&nbsp; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pRet != 0)&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *pRet = pCur;&nbsp; <br />
&nbsp;&nbsp;&nbsp; pCur = pCur-&gt;next;&nbsp; <br />
&nbsp; }&nbsp; <br />
<br />
&nbsp; <span style="color: #0000ff">if</span> (pCur != NULL &amp;&amp; pCur-&gt;row == nRow &amp;&amp; pCur-&gt;col == nCol)&nbsp; <br />
&nbsp; {&nbsp; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pRet != 0)&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *pRet = pCur;&nbsp; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> pCur-&gt;<span style="color: #0000ff">value</span>;&nbsp; <br />
&nbsp; }&nbsp; <br />
&nbsp; <span style="color: #008000">//...... </span><br />
}</div>
</div>
<p>从上面代码中可以看出，原有ICTCLAS代码充分考虑到DynamicArray是一个排序链表，因此仅仅在pCur-&gt;row &lt; nRow与pCur-&gt;col &lt; nCol范围内检索，如果找到了&#8220;pCur-&gt;row == nRow &amp;&amp; pCur-&gt;col == nCol&#8221;，那么再去做该做的事情。</p>
<p>而SharpICTCLAS中，判断条件仅为&#8220;pCur != null &amp;&amp; pCur.row != nRow&#8221;，这意味着如果你要找的nRow不再该链表中，则会来个&#8220;完全遍历&#8221;，搜索范围似乎太大了。</p>
<p>不过出于以下几点考虑我还是采用了这种表示方式：</p>
<p>1）汉语中的一句话不会太长，这意味着链表长度不会很长，即使来个&#8220;完全遍历&#8221;也不会牺牲多少时间。</p>
<p>2）毕竟要找的nRow不在该链表中的可能性不大，出现&#8220;完全遍历&#8221;的机会也不多。</p>
<p>3）原有ICTCLAS虽然在搜索范围内下了翻功夫，但为了确保pRet变量得到赋值，循环体内部多次执行了&#8220;if (pRet != 0)&#8221;的判断，这从性能角度上说得不偿失。</p>
<p>4）原有ICTCLAS为了缩小搜索范围确增加了条件判断次数&#8220;pCur != NULL &amp;&amp; (pCur-&gt;row &lt; nRow || (pCur-&gt;row == nRow &amp;&amp; pCur-&gt;col &lt; nCol))&#8221;，而由此带来的性能损失不得不考虑一下。</p>
<p>正因为以上几点考虑，所以在SharpICTCLAS中采用了这种简单而且并不见得低效的方式取代原有的GetElement方法。</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>SharpICTCLAS重新设计了DynamicArray类，力争简化原有设计中复杂的代码逻辑，应当说效果比较明显。即便有性能损失，那也是微不足道的，权衡利弊，我选择了走简单的代码这条路。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
<img src ="http://www.blogjava.net/jiangyz/aggbug/171295.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 19:31 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171295.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(2)初步分词（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171294.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 11:29:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171294.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171294.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171294.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171294.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171294.html</trackback:ping><description><![CDATA[<p>ICTCLAS初步分词包括：1）原子切分；2）找出原子之间所有可能的组词方案；3）N-最短路径中文词语粗分三步。</p>
<p>例如：&#8220;他说的确实在理&#8221;这句话。</p>
<p>1）原子切分的目的是完成单个汉字的切分。经过原子切分后变成&#8220;<font color="#008000">始##始/他/说/的/确/实/在/理/末##末</font>&#8221;。</p>
<p>2）然后根据&#8220;词库字典&#8221;找出所有原子之间所有可能的组词方案。经过词库检索后，该句话变为&#8220;<font color="#008000">始##始/他/说</font>/<font color="#0000ff">的</font>/<font color="#0000ff">的确</font>/<font color="#ff0000">确</font>/<font color="#ff0000">确实</font>/<font color="#008080">实</font>/<font color="#008080">实在</font>/<font color="#ff00ff">在</font>/<font color="#ff00ff">在理</font><font color="#008000">/末##末</font>&#8221;。</p>
<p>3）N-最短路径中文词语粗分。下面的过程就比较复杂了，首先我们要找出这些词之间所有可能的两两组合的距离（这需要检索BigramDict.dct词典库）。对于上面的案例而言，得到的BiGraph结果如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">所有可能成句的词间两两组合距离</div>
</div>
<div class="content">row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">3.37</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:<font color="#008000">始##始@他 </font><br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">2.25</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">他@说</font> <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">4.05</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">说@的</font> <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">7.11</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">说@的确</font> <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">4.10</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">的@确</font> <br />
<strong>row:&nbsp; 3,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">4.10</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">的@确实</font> <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">11.49</font>,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:<font color="#008000">的确@实</font> </strong><br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">11.63</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">确@实</font> <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">11.49</font>,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:<font color="#008000">的确@实在</font> <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">11.63</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">确@实在</font> <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">3.92</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">确实@在</font> <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">10.98</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">实@在 </font><br />
row:&nbsp; 6,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">10.97</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">确实@在理</font> <br />
row:&nbsp; 7,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">10.98</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">实@在理</font> <br />
row:&nbsp; 8,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">11.17</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">实在@理</font> <br />
row:&nbsp; 9,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">5.62</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">在@理 </font><br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">14.30</font>,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:<font color="#008000">在理@末##末</font> <br />
row: 11,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff00ff">11.95</font>,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:<font color="#008000">理@末##末</font></div>
</div>
<p>可以从上表中可以看出&#8220;的@确实&#8221;的距离为4.10，而&#8220;的确@实&#8221;间的距离为11.49，这说明&#8220;的@确实&#8221;的组合可能性要大一些。不过这只是一面之词，究竟如何组词需要放到整句话的环境下通盘考虑，达到整体最优。这就是N最短路径需要完成的工作。</p>
<p>在求最短路径前我们需要将上表换一个角度观察，其实上表可以等价的表述成如下图所示的&#8220;有向图&#8221;：</p>
<p><img height="154" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308001.gif" width="572" border="0" /></p>
<p>上表中词与词的组合（例如&#8220;的@确实&#8221;）在该图中对应一条边（由节点&#8220;的&#8221;到节点&#8220;确实&#8221;），其路径长度就是上表中的eWeight值。这样一来，求最优分词方案就成了求整体最短路径。</p>
<p>由于是初次切分，为了提高后续优化质量，这里求的是N最短路径，即路径长度由小到大排序后的前N种长度的所有路径。对于上面案例，我们假设N=2，那么求解得到的路径有2条（注意也可能比两条多，关于这点我 将在后续专门介绍NShortPath时再说）：</p>
<p>路径（1）：</p>
<p>0, 1, 2, 3, 6, 9, 11, 12, 13</p>
<p><font color="#008000">始##始 / 他 / 说 / 的 / 确实 / 在 / 理 / 末##末</font></p>
<p>路径（2）：</p>
<p>0, 1, 2, 3, 6, 10, 12, 13</p>
<p><font color="#008000">始##始 / 他 / 说 / 的 / 确实 / 在理 / 末##末</font></p>
<hr align="left" width="60%" color="#990099" />
<p>如果要想真正搞清楚上述过程，必须对下面的数据结构以及它的几种不同的表述方式有个透彻的了解。</p>
<ul>
    <li><strong><font color="#800080">DynamicArray</font></strong> </li>
</ul>
<p>DynamicArray是什么？其实它的本质是一个经过排序的链表。为了表达的更明白些，我们不妨看下面这张图：</p>
<p><img alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220001.gif" border="0" /></p>
<p>（图一）</p>
<p>上面这张图是一个按照index值进行了排序的链表，当插入新结点时必须确保index值的有序性。DynamicArray类完成的功能基本上与上面这个链表差不多，只是排序规则不是index，而是row和col两个数据，如下图：</p>
<p><img alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220002.gif" border="0" /></p>
<p>（图二）</p>
<p>大家可以看到，这个有序链表的排序规则是先按row排序，row相同的按照col排序。当然排序规则是可以改变的，如果先按col排，再按row排，则上面的链表必须表述成：</p>
<p><img alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220003.gif" border="0" /></p>
<p>（图三）</p>
<p>原有ICTCLAS实现时，在一个类里面综合考虑了row先排序和col先排序的两种情况，这在SharpICTCLAS中做了很大调整，将DynamicArray类作为父类提供一些基础操作，同时设计了RowFirstDynamicArray和ColumnFirstDynamicArray类作为子类提供专门的方法，这使得代码变得清晰多了（有关DynamicArray的实现我在下一篇文章中再做介绍）。</p>
<p>在本文最前面给出的案例中，根据&#8220;词库字典&#8221;找出所有原子字间的组词方案，其结果为&#8220;<font color="#008000">始##始/他/说</font>/<font color="#0000ff">的</font>/<font color="#0000ff">的确</font>/<font color="#ff0000">确</font>/<font color="#ff0000">确实</font>/<font color="#008080">实</font>/<font color="#008080">实在</font>/<font color="#ff00ff">在</font>/<font color="#ff00ff">在理</font><font color="#008000">/末##末</font>&#8221;，而内容就是靠一个RowFirstDynamicArray存储的，如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">程序</div>
</div>
<div class="content">row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp; 19823.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:他 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp; 17649.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 210.00,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 181.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 361.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 357.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 295.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实在 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.00,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 129.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末</div>
</div>
<ul>
    <li><strong><font color="#800080">DynamicArray的二维图表表示</font></strong> </li>
</ul>
<p>如果根据RowFirstDynamicArray中row和col的坐标信息将DynamicArray放到一个二维表格中的话，我们就得到了DynamicArray的二维图表表示。如下图所示：</p>
<p><img height="229" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220005.gif" width="460" border="0" /></p>
<p>（图四）</p>
<p>在这张图中，行和列有一个非常有意思的关系：<strong><font color="#0000ff">col为 n 的列中所有词可以与row为 n 的所有行中的词进行组合</font></strong>。例如&#8220;的确&#8221;这个词，它的col = 5，需要和它计算平滑值的有两个，分别是row = 5的两个词：&#8220;实&#8221;和&#8220;实在&#8221;。</p>
<p>如果将所有行与列之间词与词之间的关系找到，我们就可以得到另外一个ColumnFirstDynamicArray，如本文第一张表中内容所示。将该ColumnFirstDynamicArray的内容放到另外一个二维表中就得到如下图所示内容：</p>
<p><img alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220006.gif" border="0" /></p>
<p>（图五）</p>
<ul>
    <li><strong><font color="#800080">ColumnFirstDynamicArray的有向图表示</font></strong> </li>
</ul>
<p>上面这张表可以和一张有向图所对应，就如前文所说，词与词的组合（例如&#8220;的@确实&#8221;）在该图中对应一条边（由节点&#8220;的&#8221;到节点&#8220;确实&#8221;），其路径长度就是上表中的eWeight值，如下图所示：</p>
<p><img height="154" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308001.gif" width="572" border="0" /></p>
<p>剩下的事情就是针对该图求解N最短路径了，这将在后续文章中介绍。</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>ICTCLAS的初步分词是一个比较复杂的过程，涉及的数据结构、数据表示以及相关算法都颇有难度。在SharpICTCLAS中，对原有C++代码做了比较多的更改，重新设计了DynamicArray类与NShortPath方法，在牺牲有限性能的同时力争大幅度简化代码的复杂度，提高可读性。</p>
<p>有关SharpICTCLAS中DynamicArray类的实现以及在代码可读性与性能之间的权衡将在下一篇文章中加以介绍。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
<img src ="http://www.blogjava.net/jiangyz/aggbug/171294.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 19:29 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171294.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(1)读取词典库（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171291.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 11:21:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171291.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171291.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171291.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171291.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171291.html</trackback:ping><description><![CDATA[<p>ICTCLAS分词的总体流程包括：1）初步分词；2）词性标注；3）人名、地名识别；4）重新分词；5）重新词性标注这五步。就第一步分词而言，又细分成：1）原子切分；2）找出原子之间所有可能的组词方案；3）N-最短路径中文词语粗分三步。</p>
<p>在所有内容中，词典库的读取是最基本的功能。ICTCLAS中词典存放在Data目录中，常用的词典包括coreDict.dct（词典库）、BigramDict.dct（词与词间的关联库）、nr.dct（人名库）、ns.dct（地名库）、tr.dct（翻译人名库），它们的文件格式是完全相同的，都使用CDictionary类进行解析。如果想深入了解ICTCLAS词典结构，可以参考sinboy的《<a href="http://blog.csdn.net/sinboy/archive/2006/03/15/624909.aspx">ICTCLAS分词系统研究（二）--词典结构</a>》一文，详细介绍了词典结构。我这里只给出SharpICTCLAS中的实现。</p>
<p>首先是对基本元素的定义。在SharpICTCLAS中，对原有命名进行了部分调整，使得更具有实际意义并适合C#的习惯。代码如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">WordDictionaryElement.cs 程序</div>
</div>
<div class="content"><span style="color: #0000ff">using</span> System; <br />
<span style="color: #0000ff">using</span> System.Collections.Generic; <br />
<span style="color: #0000ff">using</span> System.Text; <br />
<br />
<span style="color: #0000ff">namespace</span> SharpICTCLAS <br />
{ <br />
&nbsp;&nbsp; <span style="color: #008000">//================================================== </span><br />
&nbsp;&nbsp; <span style="color: #008000">// Original predefined in DynamicArray.h file </span><br />
&nbsp;&nbsp; <span style="color: #008000">//================================================== </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> ArrayChainItem <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> col, row;<span style="color: #008000">//row and column </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">double</span> <span style="color: #0000ff">value</span>;<span style="color: #008000">//The value of the array </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> nPOS; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> nWordLen; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">string</span> sWord; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The possible POS of the word related to the segmentation graph </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> ArrayChainItem next; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> WordResult <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The word&nbsp; </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">string</span> sWord; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//the POS of the word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> nPOS; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The -log(frequency/MAX) </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">double</span> dValue; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">// data structure for word item </span><br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> WordItem <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> nWordLen; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The word&nbsp; </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">string</span> sWord; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//the process or information handle of the word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> nPOS; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The count which it appear </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> nFrequency; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//data structure for dictionary index table item </span><br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> IndexTableItem <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The count number of words which initial letter is sInit </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> nCount; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The&nbsp; head of word items </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> WordItem[] WordItems; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//data structure for word item chain </span><br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> WordChain <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> WordItem data; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> WordChain next; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//data structure for dictionary index table item </span><br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> ModifyTableItem <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The count number of words which initial letter is sInit </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> nCount; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The number of deleted items in the index table </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> nDelete; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The head of word items </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">public</span> WordChain pWordItemHead; <br />
&nbsp;&nbsp; }&nbsp; <br />
} <br />
</div>
</div>
<p>其中ModifyTableItem用于组成ModifyTable，但在实际分词时，词库往往处于&#8220;只读&#8221;状态，因此用于修改词库的ModifyTable实际上起的作用并不大。因此在后面我将ModifyTable的代码暂时省略。</p>
<p>有了基本元素的定义后，就该定义&#8220;词典&#8221;类了。原有C++代码中所有类名均以大写的&#8220;C&#8221;打头，词典类名为CDictionary，在SharpICTCLAS中，我去掉了开头的&#8220;C&#8221;，并且为了防止和系统的Dictionary类重名，特起名为&#8220;WordDictionary&#8221;类。该类主要负责完成词典库的读、写以及检索操作。让我们看看如何读取词典库：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">词典库的读取：</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> WordDictionary <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">bool</span> bReleased = <span style="color: #0000ff">true</span>; <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> IndexTableItem[] indexTable; <br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> ModifyTableItem[] modifyTable; <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">bool</span> Load(<span style="color: #0000ff">string</span> sFilename) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> Load(sFilename, <span style="color: #0000ff">false</span>); <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">bool</span> Load(<span style="color: #0000ff">string</span> sFilename, <span style="color: #0000ff">bool</span> bReset) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">int</span> frequency, wordLength, pos;&nbsp;&nbsp; <span style="color: #008000">//频率、词长、读取词性 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">bool</span> isSuccess = <span style="color: #0000ff">true</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; FileStream fileStream = <span style="color: #0000ff">null</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; BinaryReader binReader = <span style="color: #0000ff">null</span>; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">try</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fileStream = <span style="color: #0000ff">new</span> FileStream(sFilename, FileMode.Open, FileAccess.Read); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (fileStream == <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> <span style="color: #0000ff">false</span>; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; binReader = <span style="color: #0000ff">new</span> BinaryReader(fileStream, Encoding.GetEncoding(<span style="color: #ff00ff">"gb2312"</span>)); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable = <span style="color: #0000ff">new</span> IndexTableItem[Predefine.CC_NUM]; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bReleased = <span style="color: #0000ff">false</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> i = 0; i &lt; Predefine.CC_NUM; i++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//读取以该汉字打头的词有多少个 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable[i] = <span style="color: #0000ff">new</span> IndexTableItem(); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable[i].nCount = binReader.ReadInt32(); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (indexTable[i].nCount &lt;= 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">continue</span>; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable[i].WordItems = <span style="color: #0000ff">new</span> WordItem[indexTable[i].nCount]; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> j = 0; j &lt; indexTable[i].nCount; j++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable[i].WordItems[j] = <span style="color: #0000ff">new</span> WordItem(); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frequency = binReader.ReadInt32();&nbsp;&nbsp; <span style="color: #008000">//读取频率 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wordLength = binReader.ReadInt32();&nbsp; <span style="color: #008000">//读取词长 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pos = binReader.ReadInt32();&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//读取词性 </span><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (wordLength &gt; 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable[i].WordItems[j].sWord = Utility.ByteArray2String(binReader.ReadBytes(wordLength)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable[i].WordItems[j].sWord = <span style="color: #ff00ff">""</span>; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Reset the frequency </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (bReset) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable[i].WordItems[j].nFrequency = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable[i].WordItems[j].nFrequency = frequency; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable[i].WordItems[j].nWordLen = wordLength; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; indexTable[i].WordItems[j].nPOS = pos; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">catch</span> (<span style="color: #808000">Exception</span> e) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Message); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; isSuccess = <span style="color: #0000ff">false</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">finally</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (binReader != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; binReader.Close(); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (fileStream != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fileStream.Close(); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> isSuccess; <br />
&nbsp;&nbsp; }&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp; <span style="color: #008000">//...... </span><br />
} <br />
<br />
</div>
</div>
<p>下面内容节选自词库中CCID为2、3、4、5的单元， CCID的取值范围自1～6768，对应6768个汉字，所有与该汉字可以组成的词均记录在相应的单元内。词库中记录的词是没有首汉字的（我用带括号的字补上了），其首汉字就是该单元对应的汉字。词库中记录了词的词长、频率、词性以及词。</p>
<p>另外特别需要注意的是<strong><font color="#0000ff">在一个单元内，词是按照CCID大小排序的</font></strong>！这对我们后面的分析至关重要。</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">ICTCLAS词库部分内容</div>
</div>
<div class="content">汉字:埃, ID ：2 <br />
<br />
&nbsp; 词长&nbsp; 频率&nbsp; 词性&nbsp;&nbsp; 词 <br />
&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp; 128&nbsp;&nbsp;&nbsp; h&nbsp;&nbsp; (埃) <br />
&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; j&nbsp;&nbsp; (埃) <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp; n&nbsp;&nbsp; (埃)镑 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; 28&nbsp;&nbsp;&nbsp; ns&nbsp; (埃)镑 <br />
&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp; n&nbsp;&nbsp; (埃)菲尔 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp; 511&nbsp;&nbsp;&nbsp; ns&nbsp; (埃)及 <br />
&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp; ns&nbsp; (埃)克森 <br />
&nbsp;&nbsp;&nbsp; 6&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; ns&nbsp; (埃)拉特湾 <br />
&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp; nr&nbsp; (埃)里温 <br />
&nbsp;&nbsp;&nbsp; 6&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; nz&nbsp; (埃)默鲁市 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; 27&nbsp;&nbsp;&nbsp; n&nbsp;&nbsp; (埃)塞 <br />
&nbsp;&nbsp;&nbsp; 8&nbsp;&nbsp;&nbsp; 64&nbsp;&nbsp;&nbsp; ns&nbsp; (埃)塞俄比亚 <br />
&nbsp;&nbsp; 22&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; ns&nbsp; (埃)塞俄比亚联邦民主共和国 <br />
&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp; 3&nbsp;&nbsp;&nbsp; ns&nbsp; (埃)塞萨 <br />
&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp; ns&nbsp; (埃)舍德 <br />
&nbsp;&nbsp;&nbsp; 6&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; nr&nbsp; (埃)斯特角 <br />
&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; ns&nbsp; (埃)松省 <br />
&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp; 3&nbsp;&nbsp;&nbsp; nr&nbsp; (埃)特纳 <br />
&nbsp;&nbsp;&nbsp; 6&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; nz&nbsp; (埃)因霍温 <br />
==================================== <br />
汉字:挨, ID ：3 <br />
<br />
&nbsp; 词长&nbsp; 频率&nbsp; 词性&nbsp;&nbsp; 词 <br />
&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; 56&nbsp;&nbsp;&nbsp; h&nbsp;&nbsp; (挨) <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp; j&nbsp;&nbsp; (挨)次 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; 19&nbsp;&nbsp;&nbsp; n&nbsp;&nbsp; (挨)打 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 3&nbsp;&nbsp;&nbsp; ns&nbsp; (挨)冻 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp; n&nbsp;&nbsp; (挨)斗 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 9&nbsp;&nbsp;&nbsp; ns&nbsp; (挨)饿 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp; ns&nbsp; (挨)个 <br />
&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; ns&nbsp; (挨)个儿 <br />
&nbsp;&nbsp;&nbsp; 6&nbsp;&nbsp;&nbsp; 17&nbsp;&nbsp;&nbsp; nr&nbsp; (挨)家挨户 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp; nz&nbsp; (挨)近 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; n&nbsp;&nbsp; (挨)骂 <br />
&nbsp;&nbsp;&nbsp; 6&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp; ns&nbsp; (挨)门挨户 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp; ns&nbsp; (挨)批 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; ns&nbsp; (挨)整 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; 12&nbsp;&nbsp;&nbsp; ns&nbsp; (挨)着 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; nr&nbsp; (挨)揍 <br />
==================================== <br />
汉字:哎, ID ：4 <br />
<br />
&nbsp; 词长&nbsp; 频率&nbsp; 词性&nbsp;&nbsp; 词 <br />
&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; 10&nbsp;&nbsp;&nbsp; h&nbsp;&nbsp; (哎) <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 3&nbsp;&nbsp;&nbsp; j&nbsp;&nbsp; (哎)呀 <br />
&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp; n&nbsp;&nbsp; (哎)哟 <br />
==================================== <br />
汉字:唉, ID ：5 <br />
<br />
&nbsp; 词长&nbsp; 频率&nbsp; 词性&nbsp;&nbsp; 词 <br />
&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp; 9&nbsp;&nbsp;&nbsp; h&nbsp;&nbsp; (唉) <br />
&nbsp;&nbsp;&nbsp; 6&nbsp;&nbsp;&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp; j&nbsp;&nbsp; (唉)声叹气</div>
</div>
<p>在这里还应当注意的是，一个词可能有多个词性，因此一个词可能在词典中出现多次，但词性不同。若想从词典中唯一定位一个词的话，必须同时指明词与词性。</p>
<p>另外在WordDictionary类中用到得比较多的就是词的检索，这由FindInOriginalTable方法实现。原ICTCLAS代码中该方法的实现结构比较复杂，同时考虑了多种检索需求，因此代码也相对复杂一些。在SharpICTCLAS中，我对该方法进行了重载，针对不同检索目的设计了不同的FindInOriginalTable方法，简化了程序接口和代码复杂度。其中一个FindInOriginalTable方法代码如下，实现了判断某一词性的一词是否存在功能。</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">FindInOriginalTable方法的一个重载版本</div>
</div>
<div class="content"><span style="color: #0000ff">private</span> <span style="color: #0000ff">bool</span> FindInOriginalTable(<span style="color: #0000ff">int</span> nInnerCode, <span style="color: #0000ff">string</span> sWord, <span style="color: #0000ff">int</span> nPOS) <br />
{ <br />
&nbsp;&nbsp; WordItem[] pItems = indexTable[nInnerCode].WordItems; <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> nStart = 0, nEnd = indexTable[nInnerCode].nCount - 1; <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> nMid = (nStart + nEnd) / 2, nCmpValue; <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//Binary search </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">while</span> (nStart &lt;= nEnd) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCmpValue = Utility.CCStringCompare(pItems[nMid].sWord, sWord); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nCmpValue == 0 &amp;&amp; (pItems[nMid].nPOS == nPOS || nPOS == -1)) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> <span style="color: #0000ff">true</span>;<span style="color: #008000">//find it </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <span style="color: #0000ff">if</span> (nCmpValue &lt; 0 || (nCmpValue == 0 &amp;&amp; pItems[nMid].nPOS &lt; nPOS &amp;&amp; nPOS != -1)) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nStart = nMid + 1; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <span style="color: #0000ff">if</span> (nCmpValue &gt; 0 || (nCmpValue == 0 &amp;&amp; pItems[nMid].nPOS &gt; nPOS &amp;&amp; nPOS != -1)) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nEnd = nMid - 1; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nMid = (nStart + nEnd) / 2; <br />
&nbsp;&nbsp; } <br />
&nbsp;&nbsp; <span style="color: #0000ff">return</span> <span style="color: #0000ff">false</span>; <br />
} <br />
</div>
</div>
<p>其它功能在这里就不再介绍了。</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>1、WordDictionary类实现了对字典的读取、写入、更改、检索等功能。</p>
<p>2、词典中记录了以6768个汉字打头的词、词性、出现频率的信息，具体结构需要了解。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
<img src ="http://www.blogjava.net/jiangyz/aggbug/171291.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 19:21 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171291.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>实现ICTCLAS到C#平台的移植（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171290.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 11:20:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171290.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171290.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171290.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171290.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171290.html</trackback:ping><description><![CDATA[<p class="postText">在研究了一段时间中科院计算所张华平、刘群所开发的ICTCLAS分词系统（Free版）代码后，阅读了大量的相关资料，我开始着手将C++的ICTCLAS分词系统移植到.net平台下，并取得了较好的实验结果。这种移植并不容易，在研究了ICTCLAS分词理论的同时还要阅读C++代码实现，其中遇到了很多困惑、迷茫，也不得不重写了一小部分代码，我将在随后的文章中介绍具体实现。</p>
<p class="postText">目前除了最后的词性标注部分还没有完全完工外，其它部分已经接近尾声（包括初始切分、N最短路径、人名、地名的识别以及最终优化等），我们先来看看程序对以下句子的分词结果：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">SharpICTCLAS程序分词结果</div>
</div>
<div class="content">==== 原始句子： <br />
<br />
王晓平在滦南大会上说的确实在理 <br />
<br />
==== 粗切分后的结果(N个结果)： <br />
<br />
<span style="color: #008000">始##始, 王, 晓, 平, 在, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末, </span><br />
<br />
<span style="color: #008000">始##始, 王, 晓, 平, 在, 滦, 南, 大会, 上, 说, 的, 确实, 在, 理, 末##末, </span><br />
<br />
<span style="color: #008000">始##始, 王, 晓, 平, 在, 滦, 南, 大, 会上, 说, 的, 确实, 在理, 末##末, </span><br />
<br />
<span style="color: #008000">始##始, 王, 晓, 平, 在, 滦, 南, 大会, 上, 说, 的, 确实, 在理, 末##末, </span><br />
<br />
<span style="color: #008000">始##始, 王, 晓, 平, 在, 滦, 南, 大, 会, 上, 说, 的, 确实, 在, 理, 末##末, </span><br />
<br />
==== 加入对姓名、翻译人名以及地名的识别： <br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 218.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.86,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.27,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 271.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.00,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 20.37,&nbsp;&nbsp; nPOS: -28275,&nbsp;&nbsp; sWord:未##地 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 813.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp; 14536.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp; 1333.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp; 6136.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会 <br />
row:&nbsp; 8,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 469.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp; 23706.00,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:上 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight:&nbsp; 17649.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说 <br />
row: 11,&nbsp; col: 12,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row: 12,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 361.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实 <br />
row: 14,&nbsp; col: 15,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row: 14,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.00,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理 <br />
row: 15,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 129.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理 <br />
row: 16,&nbsp; col: 17,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末 <br />
<br />
==== 最终识别结果： <br />
<br />
<span style="color: #008000">始##始, 王晓平, 在, 滦南, 大会, 上, 说, 的, 确实, 在, 理, 末##末, </span><br />
<br />
--------------------------------------------------- <br />
<br />
==== 原始句子： <br />
<br />
馆内陈列周恩来和邓颖超生前使用过的物品 <br />
<br />
==== 粗切分后的结果(N个结果)： <br />
<br />
<span style="color: #008000">始##始, 馆内, 陈列, 周恩来, 和, 邓, 颖, 超, 生前, 使用, 过, 的, 物品, 末##末, </span><br />
<br />
<span style="color: #008000">始##始, 馆内, 陈列, 周恩来, 和, 邓, 颖, 超生, 前, 使用, 过, 的, 物品, 末##末, </span><br />
<br />
<span style="color: #008000">始##始, 馆内, 陈列, 周恩来, 和, 邓, 颖, 超, 生前, 使用, 过, 的, 物, 品, 末##末, </span><br />
<br />
<span style="color: #008000">始##始, 馆内, 陈列, 周恩来, 和, 邓, 颖, 超生, 前, 使, 用, 过, 的, 物品, 末##末, </span><br />
<br />
<span style="color: #008000">始##始, 馆内, 陈列, 周恩来, 和, 邓, 颖, 超, 生, 前, 使用, 过, 的, 物品, 末##末, </span><br />
<br />
==== 加入对姓名、翻译人名以及地名的识别： <br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 24.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:馆内 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 70.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:陈列 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 1990.00,&nbsp;&nbsp; nPOS:&nbsp; 28274,&nbsp;&nbsp; sWord:周恩来 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp; 72562.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:和 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 90.00,&nbsp;&nbsp; nPOS:&nbsp; 28274,&nbsp;&nbsp; sWord:邓 <br />
row:&nbsp; 9,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 15.93,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.00,&nbsp;&nbsp; nPOS:&nbsp; 28274,&nbsp;&nbsp; sWord:颖 <br />
row: 11,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 200.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:超 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:超生 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 532.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:生 <br />
row: 12,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 175.00,&nbsp;&nbsp; nPOS:&nbsp; 29696,&nbsp;&nbsp; sWord:生前 <br />
row: 13,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp; 5107.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:前 <br />
row: 14,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp; 8224.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:使 <br />
row: 14,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp; 1876.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:使用 <br />
row: 15,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp; 5300.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:用 <br />
row: 16,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp; 5090.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:过 <br />
row: 17,&nbsp; col: 18,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row: 18,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 200.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:物 <br />
row: 18,&nbsp; col: 20,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 189.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:物品 <br />
row: 19,&nbsp; col: 20,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 75.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:品 <br />
row: 20,&nbsp; col: 21,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末 </span><br />
<br />
==== 最终识别结果： <br />
<br />
<span style="color: #008000">始##始, 馆内, 陈列, 周恩来, 和, 邓颖超, 生前, 使用, 过, 的, 物品, 末##末, </span></div>
</div>
<p class="postText">从上面结果可以看出，切分效果还是令人满意的（当然这完全是由原有ICTCLAS的良好设计理论所决定）。</p>
<p class="postText">在移植的过程中，比较突出的问题包括：</p>
<h3 class="postText">1、C#支持Unicode，而原有设计是基于单字节表示</h3>
<p class="postText">在原有设计中使用了大量的字符数组，而且一个汉字在字符数组中占两个字符位置。为了取出一个字符，必须考虑是半角字符还是全角汉字。所以随处可见类似代码：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">C++代码实现取一个字符</div>
</div>
<div class="content">char tchar[3];<br />
tchar[2] = 0;<br />
<br />
tchar[0] = sWord[k]; <br />
tchar[1] = 0; <br />
<span style="color: #0000ff">if</span> (sWord[k] &lt; 0) <br />
{ <br />
&nbsp; tchar[1] = sWord[k + 1]; <br />
&nbsp; k += 1; <br />
} <br />
k += 1;</div>
</div>
<p class="postText">为了判断是否是汉字，使用了&#8220;if (sWord[k] &lt; 0) &#8221;等手段异常繁琐。</p>
<p class="postText">而C#本身对Unicode有很好的支持，所以只需要string.ToCharArray()方法就可以将一个一个字符拆分开来。但需要注意的是，在C#中一个汉字的长度是1，而C++实现中一个汉字的长度是2，这要求在移植过程中要仔细对待。</p>
<h3 class="postText">2、使用正则表达式简化了部分设计</h3>
<p class="postText">原有设计中为了判断一个字符串是否是数字需要很长的代码（例如Utility类中的IsAllNum方法），代码行数将近7～80行，而改用正则表达式后，一行代码就解决问题了。 移植后的代码使用了很多正则表达式简化类似代码。</p>
<h3 class="postText">3、字符串比较问题</h3>
<p class="postText">由于原有设计中，对汉字大小的比较是基于CCID的（尤其是对词典库进行检索时），一个汉字的CCID计算方式如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">CCID计算方法（C#）</div>
</div>
<div class="content"><span style="color: #008000">//==================================================================== </span><br />
<span style="color: #008000">// 根据汉字返回对应的CC_ID </span><br />
<span style="color: #008000">//==================================================================== </span><br />
<span style="color: #0000ff">public</span> <span style="color: #0000ff">static</span> <span style="color: #0000ff">int</span> CC_ID(<span style="color: #0000ff">char</span> c) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">byte</span>[] b = Encoding.GetEncoding(<span style="color: #ff00ff">"gb2312"</span>).GetBytes(c.ToString()); <br />
&nbsp;&nbsp; <span style="color: #0000ff">if</span> (b.Length != 2) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> -1; <br />
&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> (Convert.ToInt32(b[0]) - 176) * 94 + (Convert.ToInt32(b[1]) - 161); <br />
}<br />
</div>
</div>
<p class="postText">而C#的字符串比较没有一个适合CCID方式的字符串比较，例如在原有设计中，&#8220;声&#8221;、&#8220;生&#8221;、&#8220;现&#8221;的大小关系是：&#8220;声&#8221; &lt; &#8220;生&#8221; &lt; &#8220;现&#8221;，而C#中string.Compare方法不管设置为StringComparison.Ordinal、StringComparison.CurrentCulture还是StringComparison.InvariantCulture都无法达到这个结果，因此不得已设计了自己的字符串比较函数。</p>
<h3 class="postText">4、重写了部分代码</h3>
<p class="postText">由于原有ICTCLAS系统代码的繁琐和不易理解（可以参考《<a href="http://www.cnblogs.com/zhenyulu/articles/653254.html">天书般的ICTCLAS分词系统代码（一）</a>》、《<a href="http://www.cnblogs.com/zhenyulu/articles/657017.html">天书般的ICTCLAS分词系统代码（二）</a>》） ，我重写了部分代码，主要包括：</p>
<p class="postText">1）重写了DynamicArray代码。新代码使用了三个类实现了原有代码，将不同功能分离开，使得代码简单易读。</p>
<p class="postText">2）重写了NShortPath代码。到现在我也没有完全弄明白原作者在实现NShortPath时的思路，干脆自己写吧。重写后的新代码比原有代码简化了不少，而且比较容易理解（至少我是这么认为的）。</p>
<p class="postText">3）Segment类中重写了GenerateWord方法，使用了链表而不是数组记录结果，并采用了管道式的处理流程，这简化了后续的合并逻辑。</p>
<p class="postText">4）对原有代码中部分属性、变量、字段的命名进行了调整，让其更具有实际意义。例如原有代码中nHandle和nPOS据我理解应当是一会事，所以新程序中全部使用nPOS这个命名。</p>
<h3 class="postText">5、保留了相当一部分原有代码</h3>
<p class="postText">对于某些逻辑结构异常复杂的情况，在新代码中保留了原有的设计内容。</p>
<p class="postText">例如Segment类中对日期、年份、时间等的合并策略，其if条件嵌套有5层之多，为了保留原有逻辑，在移植过程中仅做了微小的调整。</p>
<p class="postText">另外CSpan、Unknown等类中的代码几乎没有做任何调整（其中包含了大量的计算逻辑），保留了原汁原味的内容。</p>
<p class="postText">　</p>
<p class="postText"><font color="#0000ff">我会在后续的文章中，分多次内容介绍SharpICTCLAS的实现手段以及对原有ICTCLAS代码改造的地方。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</font></p>
  <img src ="http://www.blogjava.net/jiangyz/aggbug/171290.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 19:20 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171290.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>天书般的ICTCLAS分词系统代码-2（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171289.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 11:18:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171289.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171289.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171289.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171289.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171289.html</trackback:ping><description><![CDATA[<p>上篇文章《<a href="http://www.cnblogs.com/zhenyulu/articles/653254.html">天书般的ICTCLAS分词系统代码（一）</a>》 说了说ICTCLAS分词系统有些代码让人无所适从，需要好一番努力才能弄明白究竟是怎么回事。尽管有很多人支持应当写简单、清晰的代码，但也有人持不同意见。主要集中在（1）如果效率高，代码复杂点也行； （2）只要注释写得好就行；（3）软件关键在思路（这我同意），就好像买了一台电脑，不管包装箱内的电脑本身怎么，一群人偏在死扣那个外面透明胶带帖歪了（这我坚决不同意，因为只有好思路出不来好电脑，好电脑还要性能稳定，即插即用的好硬件；另外天书般的代码不仅仅是透明胶带 贴歪的问题，他甚至可能意味着电脑中的绝缘胶带失效了...）。</p>
<p>这两天在抓紧学习ICTCLAS分词系统的思路的同时，也在消化学习它的代码实现，然而我看到的代码已经不仅仅是为了效率牺牲代码清晰度的问题了，我看到的是连作者都不知道自己真正想要做什么了，尽管程序的执行结果是正确的！</p>
<p>为了说明这种情况的严重性，我们需要从CQueue.cpp这个文件着手。我对CQueue这个类颇有些微辞，明明是个Queue，里面确用的是Push、Pop方法（让人感觉是个Stack而不是Queue），而且Pop方法纯粹是个大杂烩，不过这些都不是原则性问题，毕竟每个人有每个人写代码的习惯。CQueue完成的工作是制造一个排序队列（按照eWeight从小到大排序），如图一：</p>
<p><img height="95" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0225007.gif" width="581" border="0" /></p>
<p>（图一）</p>
<p>在了解了这些内容的基础上，让我们看看ICTCLAS中NShortPath.cpp中的代码实现（这里我们只看ShortPath方法的实现） ，为了让问题暴露得更清晰一些，我简化了代码中一些不相关的内容。</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">来自NShortPath.cpp中的ShortPath方法</div>
</div>
<div class="content"><span style="color: #0000ff">int</span> CNShortPath::ShortPath() <br />
{ <br />
&nbsp; ...... <br />
&nbsp; <span style="color: #0000ff">for</span> (; nCurNode &lt; m_nVertex; nCurNode++) <br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; CQueue queWork; <br />
&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//此处省略的代码主要负责将一些结点按照eWeight从 </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//小到大的顺序放入队列queWork </span><br />
&nbsp;&nbsp;&nbsp; ...... <br />
<br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//初始化权重 </span><br />
&nbsp;&nbsp;&nbsp; <strong><span style="color: #0000ff">for</span> (i = 0; i &lt; m_nValueKind; i++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWeight[nCurNode - 1][i] = INFINITE_VALUE; </strong><br />
<br />
&nbsp;&nbsp;&nbsp; i = 0; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (i &lt; m_nValueKind &amp;&amp; queWork.Pop(&amp;nPreNode, &amp;nIndex, &amp;eWeight) !=&nbsp; -1) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Set the current node weight and parent </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (<font color="#ff0000">m_pWeight[nCurNode - 1][i] == INFINITE_VALUE</font>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWeight[nCurNode - 1][i] = eWeight; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <span style="color: #0000ff">if</span> (<font color="#ff0000">m_pWeight[nCurNode - 1][i] &lt; eWeight</font>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Next queue </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i++; <span style="color: #008000">//Go next queue and record next weight </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (i == m_nValueKind) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Get the last position </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWeight[nCurNode - 1][i] = eWeight; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pParent[nCurNode - 1][i].Push(nPreNode, nIndex); <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp; } <br />
&nbsp; ...... <br />
}</div>
</div>
<p>上面的代码作者想干什么？让我们来分析一番：</p>
<p>变量queWork中存放的是一个按照eWeight从小到大排列的队列， 我们不妨假设里面有4个元素，其eWeight值分别是5、6、7、8。另外我们假设变量m_nValueKind的值为2，即查找最短的两条路径（<font color="#008000">注意：这种说法不完全正确，后面会解释为什么</font>）。在此假设基础上，我们看看程序是如何运行的：</p>
<p>1）将所有m_pWeight[nCurNode - 1][i]初始化为INFINITE_VALUE。</p>
<p>2）在第一轮循环中，我们从queWork中取出第一个元素，其eWeight为5，注意表达式&#8220;if (m_pWeight[nCurNode - 1][i] == INFINITE_VALUE) &#8221;没有任何作用，因为我们在第一步将所有m_pWeight[nCurNode - 1][i] 均初始化成了INFINITE_VALUE，所以第一轮循环该条件一定为true。</p>
<p>3）在第二轮循环中，我们从queWork中取出第二个元素，其eWeight为6，此时表达式&#8220;else if (m_pWeight[nCurNode - 1][i] &lt; eWeight) &#8221;似乎就没有什么作用了，因为queWork是经过排序的，第二个元素的eWeight不会小于第一个eWeight，对于我们这个例子来说， 该表达式一定为true，于是就让 i++。</p>
<p>4）紧接着你会发现程序重新进入了步骤2）的循环。</p>
<p>程序执行结果如图二：</p>
<p><img height="228" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0225008.gif" width="584" border="0" /></p>
<p>（图二）</p>
<p>如果真是这样的话，上面的代码似乎可以简化成：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">简化后的程序</div>
</div>
<div class="content"><span style="color: #0000ff">int</span> CNShortPath::ShortPath()&nbsp; <br />
{&nbsp; <br />
&nbsp; ......&nbsp; <br />
&nbsp; <span style="color: #0000ff">for</span> (; nCurNode &lt; m_nVertex; nCurNode++)&nbsp; <br />
&nbsp; {&nbsp; <br />
&nbsp;&nbsp;&nbsp; CQueue queWork;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//此处省略的代码主要负责将一些结点按照eWeight从&nbsp; </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//小到大的顺序放入队列queWork&nbsp; </span><br />
&nbsp;&nbsp;&nbsp; ......&nbsp; <br />
<br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//初始化权重&nbsp; </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span> (i = 0; i &lt; m_nValueKind; i++)&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWeight[nCurNode - 1][i] = INFINITE_VALUE;&nbsp; <br />
<br />
&nbsp;&nbsp;&nbsp; i = 0;&nbsp; <br />
<strong>&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (i &lt; m_nValueKind &amp;&amp; queWork.Pop(&amp;nPreNode, &amp;nIndex, &amp;eWeight) !=&nbsp; -1)&nbsp; <br />
&nbsp;&nbsp;&nbsp; {&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWeight[nCurNode - 1][i] = eWeight;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pParent[nCurNode - 1][i].Push(nPreNode, nIndex);&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i++; <br />
&nbsp;&nbsp;&nbsp; }&nbsp; </strong><br />
&nbsp; }&nbsp; <br />
&nbsp; ......&nbsp; <br />
}</div>
</div>
<p>对于上面这个案例，简化后的程序与ICTCLAS中的程序执行结果完全相同。可作者写出如此复杂的代码应当是有理由的，难道我们对代码的分析有什么问题吗？</p>
<p>是的！作者将一个最为重要的内容作为<strong><font color="#ff0000">隐含条件</font></strong>放入了代码之中，我们只能通过 if 条件以及 else if 条件中的内容推断出这个隐含条件究竟是什么，而这个隐含的条件恰恰应当是这段代码中最关键的内容。<font color="#0000ff">如果没能将最关键的内容展现在代码当中，而是需要读者去推断的话，我只能说连作者自己都不清楚究竟什么是最关键的东西，仅仅是让程序执行没有错误而已</font>。</p>
<p>那么究竟隐藏了什么关键的内容呢？那就是&#8220;<strong><font color="#ff0000">m_pWeight[nCurNode - 1][i] = eWeight</font></strong>&#8221;这个条件。在ShortPath方法代码中，作者用了 if 条件、 else if 条件，但都没有提及等于eWeight时程序的执行行为，他将这个留给了读者去推敲，看出来这个隐含条件就看出来了，看不出来就只能怪你自己笨了。</p>
<p>我们更换一组数据来看看：假设queWork里面有4个元素，其eWeight值分别是5、6、6、7，还假设变量m_nValueKind的值为2，那么ICTCLAS中ShortPath程序执行结果是什么呢？读者可以根据代码自己推敲一下，然后再看看下面的结果，与你预期的一样不一样。如图三。</p>
<p><img height="255" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0225009.gif" width="584" border="0" /></p>
<p>（图三）</p>
<p>这里m_Parent[nCurNode - 1][2]是一个CQueue，里面存入了eWeight为6的两个结点。这也是为什么我前文说，NShortPath中 N 如果取2，并不意味着只有两条路径。</p>
<p>如果那位有耐心看到这里，对ICTCLAS中的NShortPath.cpp代码有什么感觉呢？其实要想写出一个比较清晰的代码并不复杂，只要你真正了解究竟什么是最重要的东西，对于NShortPath.cpp中的代码，只要我们稍加修改，就可以让这天书般的代码改善不少。经过调整后的代码如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">重新改造后的代码</div>
</div>
<div class="content"><span style="color: #0000ff">int</span> CNShortPath::ShortPath()&nbsp; <br />
{&nbsp; <br />
&nbsp; ......&nbsp; <br />
&nbsp; <span style="color: #0000ff">for</span> (; nCurNode &lt; m_nVertex; nCurNode++)&nbsp; <br />
&nbsp; {&nbsp; <br />
&nbsp;&nbsp;&nbsp; CQueue queWork;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//此处省略的代码主要负责将一些结点按照eWeight从&nbsp; </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//小到大的顺序放入队列queWork&nbsp; </span><br />
&nbsp;&nbsp;&nbsp; ......&nbsp; <br />
<br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//初始化权重&nbsp; </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span> (i = 0; i &lt; m_nValueKind; i++)&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWeight[nCurNode - 1][i] = INFINITE_VALUE;&nbsp; <br />
<br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span>(queWork.Pop(&amp;nPreNode, &amp;nIndex, &amp;eWeight) != -1) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span>(i=0; i &lt; m_nValueKind ; i++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWeight[nCurNode - 1][i] = eWeight;&nbsp; <br />
<strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">do</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pParent[nCurNode - 1][i].Push(nPreNode, nIndex);&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span>(queWork.Pop(&amp;nPreNode, &amp;nIndex, &amp;new_eWeight) == -1) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">goto</span> <font color="#ff0000">finish</font>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<span style="color: #0000ff">while</span>(<font color="#ff0000">new_eWeight == eWeight</font>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
</strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; eWeight = new_eWeight; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp; }&nbsp; <br />
&nbsp; <strong><font color="#ff0000">finish: </font></strong><br />
&nbsp; ......&nbsp; <br />
}</div>
</div>
<p>经过改造的代码使用了一个do...while循环，并利用了goto命令简化代码结构，我想这样的代码读起来应当清晰多了吧。</p>
<ul>
    <li><strong><font color="#800080">小结</font></strong> </li>
</ul>
<p>（1）软件关键在思路，只有真正了解思路的人才能写出清晰的代码。如果代码不清晰，说明思路根本不清晰。</p>
<p>（2）注释写得好不如代码结构清晰。</p>
<p>（3）除非经过测试，否则不要为了一点效率提升而损失代码的可读性。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
<img src ="http://www.blogjava.net/jiangyz/aggbug/171289.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 19:18 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171289.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>天书般的ICTCLAS分词系统代码-1（转） </title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171272.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 10:50:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171272.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171272.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171272.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171272.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171272.html</trackback:ping><description><![CDATA[<p>ICTCLAS分词系统是由中科院计算所的张华平、刘群所开发的一套获得广泛好评的分词系统，该版的Free版开放了源代码，为初学者提供了宝贵的学习材料。我们可以在&#8220;<a href="http://sewm.pku.edu.cn/QA/">http://sewm.pku.edu.cn/QA/</a>&#8221;找到FreeICTCLASLinux.tar的C++代码。</p>
<p>可是目前该版本的ICTCLAS并没有提供完善的文档，所以阅读起来有一定的难度，所幸网上可以找到一些对ICTCLAS进行代码分析的文章，对理解分词系统的内部运行机制提供了很大的帮助。这些文章包括：</p>
<p>1）<a href="http://blog.csdn.net/group/ictclas4j/">http://blog.csdn.net/group/ictclas4j/</a>；《ICTCLAS分词系统研究（一）～（六）》作者：sinboy。</p>
<p>2）<a href="http://qxred.yculblog.com/post.1204714.html">http://qxred.yculblog.com/post.1204714.html</a>；《ICTCLAS 中科院分词系统 代码 注释 中文分词 词性标注》作者：风暴红QxRed 。</p>
<p>按照上面这些文章的思路去读ICTCLAS的代码，可以比较容易的理顺思路。然而在我阅读代码的过程中，越来越对ICTCLAS天书般的代码感到厌烦。我不得不佩服中科院计算所的人思维缜密，头脑清晰，能写出滴水不漏而又让那些&#8220;头脑简单&#8221;的人百思不得其解的代码。将一件本来很简单的事情做得无比复杂...</p>
<p>ICTCLAS中有一个名为CDynamicArray的类，存放在DynamicArray.cpp与DynamicArray.h两个文件中，这个DynamicArray是干什么用的？经过一番研究后终于明白是一个经过排序的链表。为了表达的更明白些，我们不妨看下面这张图：</p>
<p><img alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220001.gif" border="0" /></p>
<p>（图一）</p>
<p>上面这张图是一个按照index值进行了排序的链表，当插入新结点时必须确保index值的有序性。DynamicArray类完成的功能基本上与上面这个链表差不多，只是排序规则不是index，而是row和col两个数据，如下图：</p>
<p><img alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220002.gif" border="0" /></p>
<p>（图二）</p>
<p>大家可以看到，这个有序链表的排序规则是先按row排序，row相同的按照col排序。当然排序规则是可以改变的，如果先按col排，再按row排，则上面的链表必须表述成：</p>
<p><img alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220003.gif" border="0" /></p>
<p>（图三）</p>
<p>在了解了这些内容的基础上，不妨让我们看看ICTCLAS中DynamicArray.cpp中的代码实现（这里我们只看GetElement方法的实现，其基本功能为给出row与col，然后将对应的元素取出来）。</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">DynamicArray.cpp</div>
</div>
<div class="content">ELEMENT_TYPE CDynamicArray::GetElement(<span style="color: #0000ff">int</span> nRow, <span style="color: #0000ff">int</span> nCol, PARRAY_CHAIN pStart, <br />
&nbsp; PARRAY_CHAIN *pRet) <br />
{ <br />
&nbsp; PARRAY_CHAIN pCur = pStart; <br />
&nbsp; <span style="color: #0000ff">if</span> (pStart == 0) <br />
&nbsp;&nbsp;&nbsp; pCur = m_pHead; <br />
&nbsp; <span style="color: #0000ff">if</span> (pRet != 0) <br />
&nbsp;&nbsp;&nbsp; *pRet = NULL; <br />
&nbsp; <span style="color: #0000ff">if</span> (nRow &gt; (<span style="color: #0000ff">int</span>)m_nRow || nCol &gt; (<span style="color: #0000ff">int</span>)m_nCol) <br />
&nbsp; <span style="color: #008000">//Judge if the row and col is overflow </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> INFINITE_VALUE; <br />
&nbsp; <span style="color: #0000ff">if</span> (<strong>m_bRowFirst</strong>) <br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (<strong>pCur != NULL &amp;&amp; (nRow !=&nbsp; - 1 &amp;&amp; (<span style="color: #0000ff">int</span>)pCur-&gt;row &lt; nRow || (nCol !=&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - 1 &amp;&amp; (<span style="color: #0000ff">int</span>)pCur-&gt;row == nRow &amp;&amp; (<span style="color: #0000ff">int</span>)pCur-&gt;col &lt; nCol))</strong>) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pRet != 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *pRet = pCur; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur-&gt;next; <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp; } <br />
&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (<strong>pCur != NULL &amp;&amp; (nCol !=&nbsp; - 1 &amp;&amp; (<span style="color: #0000ff">int</span>)pCur-&gt;col &lt; nCol || ((<span style="color: #0000ff">int</span>)pCur <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -&gt;col == nCol &amp;&amp; nRow !=&nbsp; - 1 &amp;&amp; (<span style="color: #0000ff">int</span>)pCur-&gt;row &lt; nRow))</strong>) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pRet != 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *pRet = pCur; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur-&gt;next; <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp; } <br />
&nbsp; <span style="color: #0000ff">if</span> (<strong>pCur != NULL &amp;&amp; ((<span style="color: #0000ff">int</span>)pCur-&gt;row == nRow || nRow ==&nbsp; - 1) &amp;&amp; ((<span style="color: #0000ff">int</span>)pCur <br />
&nbsp;&nbsp;&nbsp; -&gt;col == nCol || nCol ==&nbsp; - 1)</strong>) <br />
&nbsp; <span style="color: #008000">//Find the same position </span><br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Find it and return the value </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pRet != 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *pRet = pCur; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> pCur-&gt;<span style="color: #0000ff">value</span>; <br />
&nbsp; } <br />
&nbsp; <span style="color: #0000ff">return</span> INFINITE_VALUE; <br />
}</div>
</div>
<p>这里我先要说明的是程序中的m_bRowFirst变量，它表示是先按row大小排列还是先按col大小排列。如果m_bRowFirst为逻辑真值，那么链表就如上面图二所示，如果为假，则如图三所示。</p>
<p>除了这个外，看到上面长长的条件表达式，你一定会吓坏了吧！更让人吓坏的是调用这段程序的代码：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">对GetElement方法的调用</div>
</div>
<div class="content"><span style="color: #008000"><br />
//来自NShortPath.cpp中ShortPath方法 </span><br />
eWeight = m_apCost-&gt;GetElement( -1, nCurNode, 0, &amp;pEdgeList); <br />
&nbsp;<br />
<span style="color: #008000">//来自Segment.cpp中BiGraphGenerate方法 </span><br />
aWord.GetElement(pCur-&gt;col, -1, pCur, &amp;pNextWords);<span style="color: #008000">//Get next words which begin with pCur-&gt;col<br />
　</span></div>
</div>
<ul>
    <li><strong><font color="#800080">先分析第一个调用</font></strong> </li>
</ul>
<p>第一个调用给GetElement方法的nRow传递了-1，他想干什么呢？</p>
<p>假设这时候变量m_bRowFirst为true，并且传递过去的nCol!=-1，那么<code>while (pCur != NULL &amp;&amp; (nRow !=&nbsp; - 1 &amp;&amp; (int)pCur-&gt;row &lt; nRow || (nCol != -1 &amp;&amp; (int)pCur-&gt;row == nRow &amp;&amp; (int)pCur-&gt;col &lt; nCol))) </code>等价于<code>while (pCur != NULL &amp;&amp; ( <font color="#ff0000">(int)pCur-&gt;row == -1</font> &amp;&amp; (int)pCur-&gt;col &lt; nCol)))</code> ，注意红色部分在程序运行时永远为false（因为根本就不存在row为-1的结点），因此，上面的表达式等价于<code>while(false)！</code>这对于该段程序没有任何意义！</p>
<p>因此我们可以得到这样一个结论：<strong><font color="#0000ff">如果GetElement方法的nRow参数取-1，当且仅当m_bRowFirst为false时才有意义</font></strong>。这时候，代码中第二个while得到执行，让我们分析一下：</p>
<p><code>while (pCur != NULL &amp;&amp; (nCol !=&nbsp; - 1 &amp;&amp; (int)pCur-&gt;col &lt; nCol || ((int)pCur-&gt;col == nCol &amp;&amp; nRow !=&nbsp; - 1 &amp;&amp; (int)pCur-&gt;row &lt; nRow))) </code>在nRow为-1时等价于<code>while (pCur != NULL &amp;&amp; ((int)pCur-&gt;col &lt; nCol ) </code>，这就容易解释的多了：在如图三所示的链表中查找col=nCol 的第一个结点。</p>
<p>My God!</p>
<ul>
    <li><font color="#800080"><strong>再分析第二个调用</strong></font> </li>
</ul>
<p>上面的第二个调用就更让人摸不着头脑了：将pCur-&gt;col传递给GetElement的nRow参数，并将-1传递给nCol参数，这想干什么呢？要想分析清楚这个问题，没有个把钟头恐怕不行（再次佩服这些中科院的牛人们）。</p>
<p>按照&#8220;分析第一个调用&#8221;中的结论可知，<strong><font color="#0000ff">如果GetElement方法的nCol参数取-1，当且仅当m_bRowFirst为true时才有意义</font></strong>。因此链表排序一定是先按照行排（如图二），此时对DynamicArray的GetElement方法的调用可以简化成：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">对方法调用进行剥离和简化</div>
</div>
<div class="content"><span style="color: #008000">//来自Segment.cpp中BiGraphGenerate方法&nbsp; </span><br />
aWord.GetElement(pCur-&gt;col, -1, pCur, &amp;pNextWords); <br />
<br />
<span style="color: #008000">//====================================================================== </span><br />
<br />
ELEMENT_TYPE CDynamicArray::GetElement(<span style="color: #0000ff">int</span> nRow, <span style="color: #0000ff">int</span> nCol, PARRAY_CHAIN pStart, PARRAY_CHAIN *pRet)&nbsp; <br />
<span style="color: #008000">// 经过调用后，上面的形参对应的值分别是：nRow：pStart-&gt;col, nCol：-1, pStart, &amp;pNextWords </span><br />
<span style="color: #008000">// 注意，为了和下面代码中的pCur以示区分，这里用了pStart这个变量名。 </span><br />
{&nbsp; <br />
&nbsp; ...... <br />
<br />
&nbsp; <span style="color: #0000ff">while</span> (pCur != NULL &amp;&amp; (<strong>(<span style="color: #0000ff">int</span>)pCur-&gt;row &lt; <font color="#ff0000">pStart-&gt;col</font></strong>))&nbsp; <br />
&nbsp; {&nbsp; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pRet != 0)&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *pRet = pCur;&nbsp; <br />
&nbsp;&nbsp;&nbsp; pCur = pCur-&gt;next;&nbsp; <br />
&nbsp; }&nbsp; <br />
<br />
&nbsp; <span style="color: #0000ff">if</span> (pCur != NULL &amp;&amp; (<strong>(<span style="color: #0000ff">int</span>)pCur-&gt;row == </strong><font color="#ff0000"><strong>pStart-&gt;col</strong></font>)&nbsp; <br />
&nbsp; <span style="color: #008000">//Find the same position&nbsp; </span><br />
&nbsp; {&nbsp; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Find it and return the value&nbsp; </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (pRet != 0)&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *pRet = pCur;&nbsp; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> pCur-&gt;<span style="color: #0000ff">value</span>;&nbsp; <br />
&nbsp; }&nbsp; <br />
&nbsp; <span style="color: #0000ff">return</span> INFINITE_VALUE;&nbsp; <br />
}　</div>
</div>
<p>此时的意义就比较明显了，其实就是找<code>pCur-&gt;row == pStart-&gt;col</code>的那个结点。</p>
<p>可有人会问，干吗把row和col扯到一起呢？这又是一个非常复杂的问题。具体内容可以参考sinboy的《<a href="http://blog.csdn.net/sinboy/archive/2006/04/14/663123.aspx">ICTCLAS分词系统研究（四）--初次切分</a>》一文。这里简单解释如下：</p>
<p>如图四，这是row优先排列的一个链表：</p>
<p><img height="546" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220004.gif" width="684" border="0" /></p>
<p>图四 进行初步分词后的链表结构（TagArrayChain）实例</p>
<p>用二维表来表示图四中的链表结构如下图五所示：</p>
<p><img height="229" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220005.gif" width="460" border="0" /></p>
<p>图五 TagArrayChain实例的二维表表示形式</p>
<p>然后找出相邻两个词的平滑值。例如&#8220;他@说&#8221;、&#8220;的@确&#8221;、&#8220;的@确实&#8221;、&#8220;的确@实&#8221;、&#8220;的确@实在&#8221;等。如果仔细观察的话，可以注意到以下特点：例如&#8220;的确&#8221;这个词，它的col = 5，需要和它计算平滑值的有两个，分别是&#8220;实&#8221;和&#8220;实在&#8221;，你会发现这两个词的row = 5。同样道理，&#8220;确&#8221;的col = 5，它也需要和&#8220;实&#8221;与&#8220;实在&#8221;（row = 5）分别计算平滑值。</p>
<p>其实，这就是为什么上面分析的找<code>pCur-&gt;row == pStart-&gt;col</code>的那个结点的原因了。最终得到的平滑值图可以表述成图六：</p>
<p><img height="473" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0220006.gif" width="667" border="0" /></p>
<p>图六 进行初次分词后生成的二叉图表的二维图表表示形式</p>
<p>到此为止才明白代码作者的真正用意：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">将该调用放到上下文中再次查看</div>
</div>
<div class="content"><span style="color: #008000">//========= 来自Segment.cpp中BiGraphGenerate方法 =========== </span><br />
......　 <br />
<span style="color: #008000">//取得和当前结点列值(col)相同的下个结点 </span><br />
<strong>aWord.GetElement(pCur-&gt;col, -1, pCur, &amp;pNextWords); </strong><br />
<span style="color: #0000ff">while</span>(<strong>pNextWords&amp;&amp;pNextWords-&gt;row==pCur-&gt;col</strong>)<span style="color: #008000">//Next words </span><br />
{&nbsp; <br />
&nbsp; <span style="color: #008000">//前后两个词用@分隔符连接起来 </span><br />
&nbsp; strcpy(sTwoWords,pCur-&gt;sWord); <br />
&nbsp; strcat(sTwoWords,WORD_SEGMENTER); <br />
&nbsp; strcat(sTwoWords,pNextWords-&gt;sWord); <br />
&nbsp; ...... <br />
} <br />
</div>
</div>
<ul>
    <li><strong><font color="#800080">小结</font></strong> </li>
</ul>
<p>想不到短短一个GetElement方法中竟然综合考虑了1）row优先排序的链表；2）col优先排序的链表；3）当nRow为-1时的行为（只有m_bRowFirst为false时才能这么做，代码中没有指，所以非常容易出错！）；4）当nCol为-1时的行为；5）当nRow与nCol都不为-1时的行为。</p>
<p>这也难怪我们会看到诸如<code>while (pCur != NULL &amp;&amp; (nRow !=&nbsp; - 1 &amp;&amp; (int)pCur-&gt;row &lt; nRow || (nCol != -1 &amp;&amp; (int)pCur-&gt;row == nRow &amp;&amp; (int)pCur-&gt;col &lt; nCol))) </code>这样的逻辑表达式了！我们也不得不佩服代码书写者复杂的逻辑思维能力（离散数学的谓词逻辑一定学得超级好）和给代码阅读者制造障碍的能力！类似代码在ICTCLAS中比比皆是，看来我只能恨自己脑筋太简单了！</p>
<br />
<br />
来源：<a href="http://www.cnblogs.com/zhenyulu/category/85598.html">http://www.cnblogs.com/zhenyulu/category/85598.html</a>
 <img src ="http://www.blogjava.net/jiangyz/aggbug/171272.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 18:50 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171272.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>