﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>语源科技BlogJava-刀剑笑</title><link>http://www.blogjava.net/jiangyz/</link><description>用技术改善你的生活</description><language>zh-cn</language><lastBuildDate>Mon, 13 Apr 2026 08:56:52 GMT</lastBuildDate><pubDate>Mon, 13 Apr 2026 08:56:52 GMT</pubDate><ttl>60</ttl><item><title>三种中文分词算法优劣比较 </title><link>http://www.blogjava.net/jiangyz/articles/238120.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Sat, 01 Nov 2008 12:16:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/articles/238120.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/238120.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/articles/238120.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/238120.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/238120.html</trackback:ping><description><![CDATA[<p><span style="font-size: 10pt">=============================================================================== </span></p>
<p><span style="font-size: 10pt">如有需要可以转载，但转载请注明出处，并保留这一块信息，谢谢合作！ </span></p>
<p><span style="font-size: 10pt">部分内容参考互联网,如有异议,请跟我联系! </span></p>
<p><span style="font-size: 10pt">作者:刀剑笑(Blog:http://blog.csdn.net/jyz3051) </span></p>
<p><span style="font-size: 10pt">Email:jyz3051 at yahoo dot com dot cn('at'请替换成'@'，'dot'请替换成'.' ) </span></p>
<p><span style="font-size: 10pt">=============================================================================== </span></p>
<p>&nbsp;</p>
<p>关键词：中文分词，中文分词算法，基于字符串匹配的分词，基于理解的分词，基于统计的分词<span style="font-size: 10pt; color: #333333; font-family: 宋体"> </span></p>
<p><span style="font-size: 12pt; font-family: 宋体"><span style="color: black">到目前为止，中文分词包括三种方法：1）基于字符串匹配的分词；2）基于理解的分词；3）基于统计的分词。到目前为止，还无法证明哪一种方法更准确，每种方法都有自己的利弊，有强项也有致命弱点，简单的对比见下表所示：</span><span style="color: #333333"> </span></span></p>
<p style="text-align: center"><span style="font-size: 12pt; font-family: 宋体"><span style="color: black">各种分词方法的优劣对比</span><span style="color: #333333"> </span></span></p>
<div>
<table style="border-collapse: collapse" border="0">
    <colgroup>
    <col style="width: 114px">
    <col style="width: 170px">
    <col style="width: 144px">
    <col style="width: 132px"></colgroup>
    <tbody valign="top">
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: 0.5pt solid; padding-left: 7px; border-left: 0.5pt solid; width: 127px; border-bottom: 0.5pt solid; height: 28px">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体"><strong>分词方法</strong></span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: 0.5pt solid; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体"><strong>基于字符串匹配分词</strong></span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: 0.5pt solid; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体"><strong>基于理解的分词</strong></span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: 0.5pt solid; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体"><strong>基于统计的分词</strong></span></p>
            </td>
        </tr>
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: 0.5pt solid; border-bottom: 0.5pt solid">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体">歧义识别</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">差</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">强</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">强</span></p>
            </td>
        </tr>
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: 0.5pt solid; border-bottom: 0.5pt solid">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体">新词识别</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">差</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">强</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">强</span></p>
            </td>
        </tr>
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: 0.5pt solid; border-bottom: 0.5pt solid">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体">需要词典</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">需要</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">不需要</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">不需要</span></p>
            </td>
        </tr>
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: 0.5pt solid; border-bottom: 0.5pt solid">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体">需要语料库</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">否</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">否</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">是</span></p>
            </td>
        </tr>
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: 0.5pt solid; border-bottom: 0.5pt solid">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体">需要规则库</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">否</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">是</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">否</span></p>
            </td>
        </tr>
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: 0.5pt solid; border-bottom: 0.5pt solid">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体">算法复杂性</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">容易</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">很难</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">一般</span></p>
            </td>
        </tr>
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: 0.5pt solid; border-bottom: 0.5pt solid">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体">技术成熟度</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">成熟</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">不成熟</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">成熟</span></p>
            </td>
        </tr>
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: 0.5pt solid; border-bottom: 0.5pt solid">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体">实施难度</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">容易</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">很难</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">一般</span></p>
            </td>
        </tr>
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: 0.5pt solid; border-bottom: 0.5pt solid">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体">分词准确性</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">一般</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">准确</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">较准</span></p>
            </td>
        </tr>
        <tr>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: 0.5pt solid; border-bottom: 0.5pt solid">
            <p><span style="font-size: 12pt; color: black; font-family: 宋体">分词速度</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">快</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">慢</span></p>
            </td>
            <td style="border-right: 0.5pt solid; padding-right: 7px; border-top: medium none; padding-left: 7px; border-left: medium none; border-bottom: 0.5pt solid">
            <p style="text-align: center"><span style="font-size: 12pt; color: black; font-family: 宋体">一般</span></p>
            </td>
        </tr>
    </tbody>
</table>
</div>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">（1）歧义识别 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">歧义识别指一个字符串有多种分词方法，计算机难以给出到底哪一种分词算法才是正确的分词系列。如"表面的"可以分为"表面/的"或"表/面的"。计算机无法判断哪一种才是准确的分词系列。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于字符串的分词算法：仅仅是跟一个电子词典进行比较，故不能进行歧义识别； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于理解的分词算法：指通过理解字符串的含义，故有很强的歧义识别能力； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于统计的分词算法：根据字符连续出现次数的多少，得到分词系列，故常常能够给出正确的分词系列选择，但是也有可能判断错误的情况。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">（2）新词识别 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">新词识别也称作未登录词识别，指正确识别词典中没有出现的词语。姓名、机构名、地址、称谓等千变万化，词典中常常不能完全收录这些词语；另外，网络中出现的流行用语也是一种未登录词的常见来源，如"打酱油"为最近出现在网络中，并迅速流行，从而成为一个新词。大量的研究证明新词识别是中文分词准确性的一个重要影响因素。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于字符串的分词算法：无法正确识别未登录词，因为这种算法仅仅与词典中存在的词语进行比较； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于理解的分词算法：理解字符串的含义，从而有很强的新词识别能力； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于统计的分词算法：这种算法对第二种未登录词有很强的识别能力，因为出现次数多，才会当作一个新词；对于第二类未登录词，这类词语有一定的规律，如姓名："姓"+ 名字，如李胜利；机构：前缀+称谓，如希望集团；故需要结合一定的规则进行识别，仅仅统计方法难以正确识别。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">（3）需要词典 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于字符串的分词算法：基本思路就是与电子词典进行比较，故电子词典是必须的。并且词典越大，分词的正确率越高，因为词典越大，未登录词越少，从而可以大大减少未登录词识别的错误； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于理解的分词算法：理解字符串的含义，故不需要一个电子词典； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于统计的分词算法：仅仅根据统计得到最终的结果，故电子词典不是必须的。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">（4）需要语料库 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于字符串的分词算法：分词过程仅仅与一个已经存在的电子词典进行比较，故不需要语料库； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于理解的分词算法：理解字符串的含义，故不需要电子词典； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于统计的分词算法：需要语料库进行统计训练，故语料库是必须的；且好的语料库是分词准确性的保证。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">（5）需要规则库 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于字符串的分词算法：分词过程仅仅与一个已经存在的电子词典进行比较，不需要规则库来进行分词； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于理解的分词算法：规则是计算机进行理解的基础，故准确、完备的规则库是这种分词算法的前提； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于统计的分词算法：根据语料库统计训练，故规则库不是必须的。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">（6）算法复杂性 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于字符串的分词算法：仅仅进行字符串的比较操作，故算法简单； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于理解的分词算法：需要充分处理各种规则，故算法非常复杂；事实上到目前为止，还没有成熟的这类算法； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于统计的分词算法：需要语料库进行训练，虽然算法也比较复杂，但是已经比较常见，故这种分词的复杂性比第一种大，比第二种容易。现在的实用分词系统都采用这种算法。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">（7）技术成熟度 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于字符串的分词算法：是最早出现也是最成熟的算法； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于理解的分词算法：是最不成熟的一类算法，到目前为止还没有成熟的算法； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于统计的分词算法：已经有多种成熟的这类算法，基本上能够满足实际的应用。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">故技术成熟度：基于匹配的分词算法〉基于理解的分词算法〉基于统计的分词算法。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">（8）实施复杂性 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">同上面的道理，实施复杂性：基于理解的分词算法〉基于统计的分词算法〉基于匹配的分词算法。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">（9）分词准确性 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">到目前为止还没有一个准确的结论，不过从理论上说，基于理解的分词算法有最高的分词准确性，理论上有100%的准确性；而基于匹配的分词算法和基于统计的分词算法是一种"浅理解"的分词方法，不涉及真正的含义理解，故可能会出现错误，难以达到100%的准确性。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">（10）分词速度 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于匹配的分词算法：算法简单，操作容易，故分词速度快，所以这种算法常常作为另外两种算法的预处理，进行字符串的粗分； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于理解的分词算法：这种算法常常需要操作一个巨大的规则库，故速度最慢； </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">基于统计的分词算法：这种分词算法仅仅是与一个统计结果进行比较，故速度一般。 </span></p>
<p><span style="font-size: 12pt; color: #333333; font-family: 宋体">故一般的分词速度从快到慢依次为：基于匹配的分词算法〉基于统计的分词算法〉基于理解的分词算法。 </span></p>
<p>&nbsp;</p>
<img src ="http://www.blogjava.net/jiangyz/aggbug/238120.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2008-11-01 20:16 <a href="http://www.blogjava.net/jiangyz/articles/238120.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>ICTCLAS分词系统研究（一） （转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171360.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 15:58:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171360.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171360.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171360.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171360.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171360.html</trackback:ping><description><![CDATA[&nbsp;ICTClAS分词系统是由中科院计算所的张华平、刘群所开发的一套获得广泛好评的分词系统，难能可贵的是该版的Free版开放了源代码，为我们很多初学者提供了宝贵的学习材料。
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 但有一点不完美的是，该源代码没有配套的文档，阅读起来可能有一定的障碍，尤其是对C/C++不熟的人来说.本人就一直用Java/VB作为主要的开发语言,C/C++上大学时倒是学过,不过工作之后一直没有再使用过,语法什么的忘的几乎一干二净了.但语言这东西,基本的东西都相通的,况且Java也是在C/C++的基础上形成的,有一定的相似处.阅读一遍源代码,主要的语法都应该不成问题了.</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;虽然在ICTCLAS的系统中没有完整的文档说明,但是我们可以通过查阅张华平和刘群发表的一些相关论文资料,还是可以窥探出主要的思路.</p>
<p>&nbsp;&nbsp; 该分词系统的主要是思想是先通过CHMM(层叠形马尔可夫模型)进行分词,通过分层,既增加了分词的准确性,又保证了分词的效率.共分五层,如下图一所示:</p>
<p><img alt="" src="http://p.blog.csdn.net/images/p_blog_csdn_net/sinboy/CHMM结构图.bmp" /></p>
<p>基本思路:先进行原子切分,然后在此基础上进行N-最短路径粗切分,找出前N个最符合的切分结果,生成二元分词表,然后生成分词结果,接着进行词性标注并完成主要分词步骤.</p>
<p>下面是对源代码的主要内容的研究：</p>
<p>１.首先，ICTCLAS分词程序首先调用CICTCLAS_WinDlg::OnBtnRun()开始程序的执行.并且可以从看出它的处理方法是把源字符串分段处理。并且在分词前，完成词典的加载过程，即生成m_ICTCLAS对象时调用构造函数完成词典库的加载。关于词典结构的分析，请参加分词系统研究（二）。</p>
<p>void CICTCLAS_WinDlg::OnBtnRun() <br />
{</p>
<p>&nbsp;&nbsp; ......</p>
<p><font color="#3366ff">&nbsp;</font><font color="#0000ff">//在此处进行分词和词性标记</font></p>
<p>&nbsp; if(!<font color="#ff0000">m_ICTCLAS.ParagraphProcessing</font>((char *)(LPCTSTR)m_sSource,sResult))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;m_sResult.Format("错误：程序初始化异常！");<br />
&nbsp;&nbsp;&nbsp;else<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;m_sResult.Format("%s",sResult);<font color="#0000ff">//输出最终分词结果</font></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;......</p>
<p>}</p>
<p>２.在OnBtnRun()方法里面调用分段分词处理方法bool CResult::ParagraphProcessing(char *sParagraph,char *sResult)完成分词的整个处理过程，包括分词的词性标注.其中第一个参数为源字符串，第二个参数为分词后的字符串.在这两个方法中即完成了整个分词处理过程，下面需要了解的是在此方法中，如何调用其它方法一步步按照上图所示的分析框架完成分词过程.为了简单起见，我们先不做未登录词的分析。</p>
<p><font color="#0000ff">//Paragraph Segment and POS Tagging</font><br />
bool CResult::ParagraphProcessing(char *sParagraph,char *sResult)<br />
{</p>
<p>&nbsp;&nbsp; ........</p>
<p>&nbsp;&nbsp; <font color="#ff0000">Processing</font>(sSentence,1); <font color="#0000ff">//Processing and output the result of current sentence.<br />
</font>&nbsp;&nbsp;Output(m_pResult[0],sSentenceResult,bFirstIgnore);<font color="#0000ff"> //Output to the imediate result</font></p>
<p>&nbsp; .......</p>
<p>}</p>
<p>３.主要的分词处理是在Processing()方法里面发生的，下面我们对它进行进一步的分析.</p>
<p>bool CResult::Processing(char *sSentence,unsigned int nCount)<br />
{</p>
<p>......</p>
<p><font color="#3366ff">&nbsp;</font><font color="#0000ff">//进行二叉分词</font></p>
<p>m_Seg.BiSegment(sSentence, m_dSmoothingPara,m_dictCore,m_dictBigram,nCount);</p>
<p>......</p>
<p><font color="#3366ff">&nbsp;</font><font color="#0000ff">//在此处进行词性标注</font></p>
<p>m_POSTagger.POSTagging(m_Seg.m_pWordSeg[nIndex],m_dictCore,m_dictCore);</p>
<p>......</p>
<p>}</p>
<p>４.现在我们先不管词性标注，把注意力集中在二叉分词上，因为这个是分词的两大关键步骤的第一步.</p>
<p>参考文章:</p>
<p>1.&lt;&lt;基于层叠隐马模型的汉语词法分析&gt;&gt;,刘群 张华平等</p>
<p>2.&lt;&lt;基于N-最短路径的中文词语粗分模型&gt;&gt;,张华平 刘群</p>
<br />
<br />
<p id="TBPingURL">来源：<a href="http://blog.csdn.net/sinboy/archive/2006/03/12/622596.aspx">http://blog.csdn.net/sinboy/archive/2006/03/12/622596.aspx</a></p>
<img src ="http://www.blogjava.net/jiangyz/aggbug/171360.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 23:58 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171360.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>ICTCLAS 中科院分词系统 代码 注释 中文分词 词性标注（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171335.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 14:10:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171335.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171335.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171335.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171335.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171335.html</trackback:ping><description><![CDATA[<p>中科院分词系统概述</p>
<p>这几天看完了中科院分词程序的代码，现在来做一个概述，并对一些关键的数据结构作出解释</p>
<p><br />
〇、总体流程</p>
<p>考虑输入的一句话，sSentence="张华平欢迎您"</p>
<p>总体流程:</p>
<p>一、分词 "张/华/平/欢迎/您"</p>
<p>二、posTagging "张/q 华/j 平/j 欢迎/v 您/r"</p>
<p>三、NE识别:人名识别,音译名识别,地名识别 "张/q 华/j 平/j 欢迎/v 您/r" "张华平/nr"</p>
<p>四、重新分词:"张华平/欢迎/您"</p>
<p>五、重新posTagging: "张华平/nr 欢迎/v 您/r"</p>
<p>&nbsp;</p>
<p><br />
技术细节</p>
<p>一、分词</p>
<p>分词程序首先在其头末添加开始符和结束符<br />
sSentence="始##始张华平欢迎您末##末"</p>
<p>然后是分词,基本思想就是分词的得到的词的联合概率最大</p>
<p>假设 "张华平欢迎您" 分为 "w_1/w_2/.../w_k" 则<br />
w_1/w_2/.../w_k=argmax_{w_1'/w_2'/.../w_k'}P(w_1',w_2',...,w_k')=argmax_{w_1'/w_2'/.../w_k'}P(w_1')P(w_2')...P(w_k')</p>
<p>细节:</p>
<p>首先给原句按字划分,所有汉字一个一段,连续的字母,数字一段,比如"始##始张华平2006欢迎您asdf末##末"被划为"始##始/张/华/平/2006/欢/迎/您/asdf/末##末"</p>
<p>接着找出这个句子中所有可能出现的词,比如"始##始张华平欢迎您末##末",出现的词有"始##始","张","华","平","欢","迎","您","末##末","欢迎"<br />
并查找这些词所有可能的词性和这些词出现的频率。</p>
<p>将这些词保存在一个结构中,具体实现如下:</p>
<p>m_segGraph中有一个(PARRAY_CHAIN)m_pHead，是一个链</p>
<p>(PARRAY_CHAIN)p-&gt;row//记录该词的头位置<br />
(PARRAY_CHAIN)p-&gt;col//记录该词的末位置<br />
(PARRAY_CHAIN)p-&gt;value//记录该词的-log(出现的概率),出现的频率指所有该词的所有词性下出现的概率的总和。<br />
(PARRAY_CHAIN)p-&gt;nPos//记录该词的词性，比如人名标记为'nr'，则对应的nPos='n'*256+'r',如果该词有很多词性,则nPos=0<br />
(PARRAY_CHAIN)p-&gt;sWord//记录该词<br />
(PARRAY_CHAIN)p-&gt;nWordLen//记录该词的长度</p>
<p>举个例子：<br />
"0 始##始 1 张 2 华 3 平 4 欢 5 迎 6 您 7 末##末 8"</p>
<p>对于"张"来说,<br />
row=1<br />
col=2<br />
value=-log[("张"出现的频率+1)/(MAX_FREQUENCE)]<br />
nPos=0//"张"有5种词性<br />
sWord="张"<br />
nWordLen=2</p>
<p>保存的顺序是按col升序row升序的次序排列</p>
<p>m_segGraph.m_pHead&nbsp;"始##始"<br />
&nbsp;&nbsp;&nbsp;"张"<br />
&nbsp;&nbsp;&nbsp;"华"<br />
&nbsp;&nbsp;&nbsp;"平"<br />
&nbsp;&nbsp;&nbsp;"欢"<br />
&nbsp;&nbsp;&nbsp;"欢迎"<br />
&nbsp;&nbsp;&nbsp;"迎"<br />
&nbsp;&nbsp;&nbsp;"您"<br />
&nbsp;&nbsp;&nbsp;"末##末"<br />
&nbsp;&nbsp;&nbsp;<br />
m_segGraph.m_nRow=7<br />
m_segGraph.m_nCol=8</p>
<p><br />
然后是生成一幅给予各种组合情况的图,并按照出现的概率大小保存概率最大的前m_nValueKind个结果。</p>
<p>细节:</p>
<p>初始化,<br />
(CNShortPath)sp.m_apCost=m_segGraph;<br />
(CNShortPath)sp.m_nVertex=m_segGraph.m_nCol+1<br />
(CNShortPath)sp.m_pParent=CQueue[m_segGraph.m_nCol][m_nValueKind]<br />
(CNShortPath)sp.m_pWeight=ELEMENT_TYPE[m_segGraph.m_nCol][m_nValueKind]//m_pWeight[0][0]表示1处的weight</p>
<p>sp.ShortPath()函数中,<br />
for(nCurNode=1;nCurNode&lt;sp.m_nVertex;nCurNode++)<br />
{<br />
&nbsp;CQueue queWork;//零时的CQueue<br />
&nbsp;eWeight=m_apCost-&gt;GetElement(-1,nCurNode,0,&amp;pEdgeList);//取出col=nCurNode的第一个PARRAY_CHAIN的value,比如nCurNode=6,则pEdgeList指向"欢迎",eWeight="pEdgeList-&gt;value<br />
&nbsp;while(pEdgeList&amp;&amp;pEdgeList-&gt;col==nCurNode)//对每一个col=nCurNode的pEdgeList<br />
&nbsp;{<br />
&nbsp;<br />
&nbsp;&nbsp;for(i=0;i&lt;m_nValueKind;i++)<br />
&nbsp;&nbsp;{<br />
&nbsp;&nbsp;&nbsp;queWork.Push(pEdgeList-&gt;row,0,eWeight+m_pWeight[pEdgeList-&gt;row-1][i]);<br />
&nbsp;&nbsp;&nbsp;//将所有col=nCurNode的pEdgeList按照其weight升序放到queWork中<br />
&nbsp;&nbsp;}<br />
&nbsp;}//比如<br />
&nbsp;/*<br />
&nbsp;&nbsp;"欢迎"&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp; m_pWeight[3][0]=0.2&nbsp;&nbsp;&nbsp;&nbsp; eWight=0.2&nbsp;&nbsp;&nbsp;&nbsp;=&gt;queWork.Push(4,0,0.4);<br />
&nbsp;&nbsp;"0 始##始 1 张 2 华 3 平 &nbsp;&nbsp;4 &nbsp;欢 &nbsp;&nbsp;&nbsp;5 &nbsp;&nbsp;迎 6 您 7 末##末 8"<br />
&nbsp;&nbsp;"欢"&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWeight[4][0]=0.5&nbsp;eWight=0.1&nbsp;&nbsp;=&gt;queWork.Push(5,0,0.6);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWeight[4][1]=0.6&nbsp;eWight=0.1&nbsp;&nbsp;=&gt;queWork.Push(5,0,0.7);<br />
&nbsp;<br />
&nbsp;queWork&nbsp;&nbsp;"欢迎"&nbsp;&nbsp;0.4<br />
&nbsp;&nbsp;&nbsp;"迎"&nbsp;&nbsp;0.6<br />
&nbsp;&nbsp;&nbsp;"迎"&nbsp;&nbsp;0.7<br />
&nbsp;<br />
&nbsp;*/<br />
&nbsp;for(i=0;i&lt;m_nValueKind;i++)m_pWeight[nCurNode-1][i]=INFINITE_VALUE;//初始化当前的m_pWeight[nCurNode-1]<br />
&nbsp;while(i&lt;m_nValueKind&amp;&amp;queWork.Pop(&amp;nPreNode,&amp;nIndex,&amp;eWeight)!=-1)//从queWork中顺序取出每个pEdgeList的row,nIndex的取值从0到m_nValueKind-1,eWeight=pEdgeList-&gt;value<br />
&nbsp;{<br />
&nbsp;&nbsp;m_pWeight[nCurNode-1][i]=eWeight;//取前m_nValueKind个结果<br />
&nbsp;&nbsp;m_pParent[nCurNode-1][i].Push(nPreNode,nIndex);//按照pEdgeList-&gt;value的升序,也就是P的降序放入m_pParent<br />
&nbsp;}<br />
}</p>
<p>得到m_pParent之后,按照m_pWeight[m_segGraph.m_nCol-1]的升序,生成path<br />
CNShortPath::GetPaths(unsigned int nNode,unsigned int nIndex,int **nResult,bool bBest)<br />
//nNode=m_segGraph.m_nCol,nIndex从0取到m_nValueKind-1,nResult输出结果,bBest=true只输出最佳结果<br />
比如"始##始张华平欢迎您末##末"的结果为<br />
nResult[0]={0,1,2,3,4,6,7,8,-1}&nbsp;&nbsp;"始##始/张/华/平/欢迎/您/末##末"<br />
nResult[1]={0,1,2,3,4,5,6,7,8,-1}&nbsp;"始##始/张/华/平/欢/迎/您/末##末"<br />
没有第三种结果</p>
<p>取出所有nResult[i]作为分词结果,结果保存在m_graphOptimum中,m_graphOptimum和m_segGraph结构一样,只不过只存nResult[i]中的结果:</p>
<p>如果m_nValueKind=1则<br />
m_graphOptimum.m_pHead&nbsp;"始##始"<br />
&nbsp;&nbsp;&nbsp;"张"<br />
&nbsp;&nbsp;&nbsp;"华"<br />
&nbsp;&nbsp;&nbsp;"平"<br />
&nbsp;&nbsp;&nbsp;"欢迎"<br />
&nbsp;&nbsp;&nbsp;"您"<br />
&nbsp;&nbsp;&nbsp;"末##末"<br />
&nbsp;&nbsp;&nbsp;<br />
m_graphOptimum.m_nRow=7<br />
m_graphOptimum.m_nCol=8</p>
<p><br />
如果m_nValueKind=2则</p>
<p>m_graphOptimum.m_pHead&nbsp;"始##始"<br />
&nbsp;&nbsp;&nbsp;"张"<br />
&nbsp;&nbsp;&nbsp;"华"<br />
&nbsp;&nbsp;&nbsp;"平"<br />
&nbsp;&nbsp;&nbsp;"欢"<br />
&nbsp;&nbsp;&nbsp;"欢迎"<br />
&nbsp;&nbsp;&nbsp;"迎"<br />
&nbsp;&nbsp;&nbsp;"您"<br />
&nbsp;&nbsp;&nbsp;"末##末"<br />
&nbsp;&nbsp;&nbsp;<br />
m_graphOptimum.m_nRow=7<br />
m_graphOptimum.m_nCol=8</p>
<p><br />
见&nbsp;bool CSegment::GenerateWord(int **nSegRoute, int nIndex)这里的nSegRoute=上面的nResult,是输入参数;nIndex表示第nIndex个分词结果</p>
<p><br />
同时,CResult.m_Seg.m_pWordSeg[nIndex][k]中保存了第nIndex个结果的第k个词的信息:</p>
<p>CResult.m_Seg.m_pWordSeg[nIndex][k].sWord//词<br />
CResult.m_Seg.m_pWordSeg[nIndex][k].nHandle//词性<br />
CResult.m_Seg.m_pWordSeg[nIndex][k].dValue//-logP</p>
<p>至此,分词部分结束</p>
<p>二、posTagging</p>
<p><br />
m_POSTagger.POSTagging(m_Seg.m_pWordSeg[nIndex],m_dictCore,m_dictCore);//对第nIndex个分词结果用标准的字典标注<br />
方便起见,下面假设m_nValueKind=1</p>
<p><br />
m_POSTagger用HMM对分词进行标注，这里输出概率为P(w_i|c_i)，c_i为词性，w_i为词；转移概率为P(c_i|c_{i-1})，初始状态为P(c_0)即P("始##始"的词性)<br />
用维特比算法求出一个c_1/c_2/.../c_k=argmax_{c_1'/c_2'/.../c_k'}P(w_1',w_2',...,w_k')</p>
<p>将句子分成若干段,每段以有唯一pos的w结尾,也就是分词中CResult.m_Seg.m_pWordSeg[0][k].nHandle&gt;0的那些词</p>
<p>比如,举个例子<br />
"0 始##始 1 张&nbsp; 2&nbsp;&nbsp; 华&nbsp;&nbsp; 3&nbsp;&nbsp; 平&nbsp;&nbsp; 4&nbsp;&nbsp; 欢迎&nbsp;&nbsp; 5&nbsp;&nbsp; 您&nbsp;&nbsp; 6 末##末 7"<br />
&nbsp;&nbsp;&nbsp; pos1&nbsp;&nbsp; pos1&nbsp;&nbsp; pos1&nbsp;&nbsp;&nbsp;&nbsp; pos1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pos1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pos1&nbsp;&nbsp;&nbsp;&nbsp; pos1<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pos2&nbsp;&nbsp; pos2&nbsp;&nbsp;&nbsp;&nbsp; pos2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pos2<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pos3&nbsp;&nbsp; pos3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pos3<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pos4<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pos5</p>
<p>则该句被划分为<br />
"0 始##始"<br />
"1 张&nbsp; 2&nbsp;&nbsp; 华&nbsp;&nbsp; 3&nbsp;&nbsp; 平 4&nbsp; 欢迎&nbsp;&nbsp; 5&nbsp;&nbsp; 您"<br />
"6 末##末"<br />
对每一段用维特比算法确定一个唯一的postag</p>
<p>细节：</p>
<p>首先P(w,c)的输出概率存储在dict中，比如dictCore,dictUnknow,通过dict.GetFrequency(char *sWord, int nHandle)函数获取 sWord pos为nHandle的函数<br />
概率P(c)存储在context中,比如m_context,通过context.GetFrequency(int nKey, int nSymbol)函数获取 pos为nSymbol的函数,nKey=0<br />
转移概率P(c|c')存储在context中,比如m_context,通过context.GetContextPossibility(int nKey, int nPrev, int nCur)函数获取 c'=nPrev,c=nCur的转移概率,nKey=0</p>
<p>重要的数据结构</p>
<p>m_nTags[i][k]表示第i个w的第k个pos<br />
在GetFrom函数中表示 -log(第i个w的第k个pos的输出概率)<br />
在CSpan::Disamb()函数中<br />
m_dFrequency[i][k]表示 -log(从第0个w到第i个w的第k个pos的联合最大输出概率),比如</p>
<p><br />
&nbsp;w_j&nbsp;&nbsp;&nbsp;w_{j+1}<br />
m_dFrequency[j][0]--&nbsp;m_dFrequency[j+1][0]<br />
m_dFrequency[j][1]&nbsp; --&nbsp;m_dFrequency[j+1][1]<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; --m_dFrequency[j+1][2]</p>
<p>&nbsp;</p>
<p>则 图中的路径的权为W([j,0]-&gt;[j+1,2])=m_dFrequency[j][0]-log(m_context.GetContextPossibility(0,m_nTags[j][0],m_nTags[j+1][2]))<br />
这样,选择<br />
m_dFrequency[j+1][2]=min{W([j,0]-&gt;[j+1,2]),W([j,1]-&gt;[j+1,2])}</p>
<p><br />
m_nCurLength表示当前段的w个数+1</p>
<p>在m_POSTagger.POSTagging中,以上面的例子为例<br />
while(i&gt;-1&amp;&amp;pWordItems[i].sWord[0]!=0)//将执行段的个数次,比如上例中将执行3次<br />
{<br />
&nbsp;i=GetFrom(pWordItems,nStartPos,dictCore,dictUnknown);//i=GetFrom(pWordItems,0,dictCore,dictUnknown)=1<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;//i=GetFrom(pWordItems,1,dictCore,dictUnknown)=6<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;//i=GetFrom(pWordItems,6,dictCore,dictUnknown)=7<br />
&nbsp;//从nStartPos向前取w,一直取到一个有唯一pos的w为止,该过程中记录每个w的pos,保存在m_nTags中,记录log(w|c)输出概率保存在m_dFrequency中<br />
&nbsp;GetBestPOS();//调用Disamb()函数,用维特比算法找出该段的最佳(联合输出概率最大)的标注,最佳路径保存在m_nBestTag中<br />
&nbsp;通过读取m_nBestTag对pWordItems.nHandle进行赋值<br />
}</p>
<p><br />
三、NE识别:人名识别,音译名识别,地名识别</p>
<p>其基本思路和PosTagging一样，只不过词性c换成了role r，以人名识别为例,首先识别出人名的tag(即pos)，见<br />
"Chinese Named Entity Recognition Using Role Model"<br />
在函数CUnknowWord::Recognition(PWORD_RESULT pWordSegResult, CDynamicArray &amp;graphOptimum,CSegGraph &amp;graphSeg,CDictionary &amp;dictCore)中<br />
每个被切开的段被识别完之后，用m_roleTag.POSTagging(pWordSegResult,dictCore,m_dict);对第一步分词的结果进行一次标记。<br />
首先用dictUnknown.GetHandle(m_sWords[i],&amp;nCount,aPOS,aFreq);获得m_sWords[i]在NE词典中的role,<br />
接着用dictCore.GetHandle(m_sWords[i],&amp;nCount,aPOS,aFreq);获得m_sWords[i]在标准词典中的tag,这里只要m_sWords[i]在标准词典中有tag，那么tag一律标记为0，该tag下的输出概率为P(w|c)=P(sum_{aFreq}|c=0)<br />
接下来用SplitPersonPOS(dictUnknown)函数将其中tag为LH和TR的w拆成两个<br />
比如"张/SS 华/GH 平欢/TR 迎/RC 您/RC"中"平欢"被拆成"平/GT" "欢/12"<br />
接着在PersonRecognize(dictUnknown);函数中,用一些模板进行匹配,"SS/GH/TR"将匹配到"张华平"。匹配得到的片断保存在m_nUnknownWords中，其nHandle被设置为人名，地名，音译名中的一个<br />
对第一步中的graphOptimum，加入m_nUnknownWords的边：<br />
graphOptimum.GetElement(nAtomStart,nAtomEnd,&amp;dValue,&amp;nPOSOriginal);<br />
if(dValue&gt;m_roleTag.m_dWordsPossibility[i])//Set the element with less frequency<br />
&nbsp;graphOptimum.SetElement(nAtomStart,nAtomEnd,m_roleTag.m_dWordsPossibility[i],m_nPOS);</p>
<p>四、重新分词</p>
<p>对上一步的graphOptimum，用第一步中对m_segGraph分词的方法，找出一个联合概率最大的分词结果：<br />
m_Seg.OptimumSegmet(nCount);</p>
<p>五、重新标注</p>
<p>对于四中分好的结果，用标准词典对其进行posTagging:<br />
for(nIndex=0;nIndex&lt;m_Seg.m_nSegmentCount;nIndex++)//m_Seg.m_nSegmentCount是第四步中的分词结果个数<br />
{<br />
&nbsp;m_POSTagger.POSTagging(m_Seg.m_pWordSeg[nIndex],m_dictCore,m_dictCore);<br />
}</p>
<p><br />
最后，用Sort();对标注结果按照联合输出概率的大小降序排序，并按照用户的需求输出前几个</p>
<br />
来源：http://qxred.yculblog.com/post.1204714.html
<img src ="http://www.blogjava.net/jiangyz/aggbug/171335.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 22:10 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171335.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS 1.0 发布! （转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171318.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:55:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171318.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171318.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171318.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171318.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171318.html</trackback:ping><description><![CDATA[<p><font color="#ff0000">SharpICTCLAS 1.0 发布 （感谢<a href="http://www.gk-z.com/" target="_blank">工控网</a>发现了一个问题，问题出在字符串比较上，目前已经修正，请重新下载。2007年4月20日）</font></p>
<ul>
    <li><a href="http://www.cnblogs.com/Files/zhenyulu/SharpICTCLAS分词系统_1.0.rar">下载 SharpICTCLAS 1.0</a> </li>
</ul>
<p>　</p>
<h3>一、SharpICTCLAS 1.0 版相对于测试版的改进</h3>
<p>1、修改了原子分词代码，使得对于全角字母有较好的识别</p>
<p>2、修改了部分词性标注部分的代码</p>
<p>因为词性标注部分的代码存在问题（应当是从ICTCLAS就存在的问题），主要表现在如果某个汉字没有词性，则在词性标注时会出现异常。例如：&#8220;这些是永远也没有现成的答桉的&#8221;其中&#8220;答案&#8221;写错了，当对这个有错别字的句子分词时，&#8220;桉&#8221;字是没有词性的，程序在此时将出现错误。</p>
<p>目前的解决办法是对于这些没有词性的词在最终标注时标注为&#8220;字符串&#8221;。</p>
<p>2、修改了地名识别的一些问题</p>
<p>这个问题出现在Span类的PlaceRecognize方法中，nStart与nEnd在某些时候会计算错误。在测试版SharpICTCLAS中，句子&#8220;明定陵是明十三陵中第十座陵墓&#8221;在分词时会因为这个问题导致异常。 </p>
<p>3、修改了基于CCID的字符串比较代码</p>
<p>原有代码没有很好考虑对全角、半角混合字符串的比较问题，现在修正过来了。</p>
<p>4、修改了向词库添加词汇的代码</p>
<p>原有代码存在错误，现在改正了过来。</p>
<h3>二、仍然有待改进的地方</h3>
<p>现在的程序仍然有很多地方有待改进，例如原子分词部分的代码对电子邮件、URL等识别还不是很好，日后可利用正则表达式加以改进；除此之外，对于词性标注以及人名地名识别部分代码 ，我除了修改了部分问题代码外，没有做任何改进和调整，这使得整个代码显得凌乱，有待做一次全面重构。</p>
<h3>三、SharpICTCLAS使用时的一些示例代码</h3>
<p>为了能够更好的使用SharpICTCLAS，现提供一些示例代码，主要完成的工作包括：1）向词库中添加新词汇；2）对文件的预处理，实现繁体向简体的转换、全角字符向半角字符的转换、利用正则表达式过滤多余HTML标记以及断句等。具体可以访问我的文章《<a href="http://www.cnblogs.com/zhenyulu/articles/718375.html">SharpICTCLAS分词系统简介(9)词库扩充</a>》。</p>
<p>目前经过调整后的SharpICTCLAS运行效果还算不错。在对博客园一万五千篇文章进行分词测试过程中，向词库中添加了一千三百多个词汇然后进行分词，效果还不错， 分词异常一共发生了15次，其中有9处是因为存在大量日文字符，另外6处是一句话中单词过多，超出了软件限制（200词）。分词效率也比较令人满意（尽管总体还是比较慢），15000篇文章总用时2.5小时，但这不只是分词的时间，还包括了繁体转简体、利用正则表达式去掉HTML符号，统计词频（这需要进行重复词的判别，我使用了AVL树 ，共统计得到16万词汇）、将分词结果写入SQL Server 2005数据库。如果不考虑这些因素的话，感觉应当和C＋＋程序效率差不多，当然这是没有经过严格测试的结论。</p>
<p>如果大家在使用时发现什么新问题，还请及时告知，我会继续修正这些问题。</p>
<p>　</p>
<hr align="left" width="400" />
<p>　</p>
<ul>
    <li><font color="#800080"><strong>ICTCLAS简介：</strong></font> </li>
</ul>
<p>计算所汉语词法分析系统ICTCLAS（Institute of Computing Technology, Chinese Lexical Analysis System），功能有：中文分词；词性标注；未登录词识别。分词正确率高达97.58%(973专家评测结果)，未登录词识别召回率均高于90%，其中中国人名的识别召回率接近98%;处理速度为31.5Kbytes/s。</p>
<p>著作权： Copyright(c)2002-2005中科院计算所 职务著作权人：张华平</p>
<p>遵循协议：自然语言处理开放资源许可证1.0</p>
<p>Email: <a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;&#122;&#104;&#97;&#110;&#103;&#104;&#112;&#64;&#115;&#111;&#102;&#116;&#119;&#97;&#114;&#101;&#46;&#105;&#99;&#116;&#46;&#97;&#99;&#46;&#99;&#110;">zhanghp@software.ict.ac.cn</a></p>
<p>Homepage: <a href="http://www.i3s.ac.cn/">http://www.i3s.ac.cn</a></p>
<p>　</p>
<ul>
    <li><strong><font color="#800080">SharpICTCLAS：</font></strong> </li>
</ul>
<p>.net平台下的ICTCLAS，是由河北理工大学经管学院吕震宇根据Free版ICTCLAS改编而成，并对原有代码做了部分重写与调整。</p>
<p>Email: <a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;&#122;&#104;&#101;&#110;&#121;&#117;&#108;&#117;&#64;&#49;&#54;&#51;&#46;&#99;&#111;&#109;">zhenyulu@163.com</a></p>
<p>Blog: <a href="http://www.cnblogs.com/zhenyulu">http://www.cnblogs.com/zhenyulu</a><br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
  <img src ="http://www.blogjava.net/jiangyz/aggbug/171318.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:55 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171318.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(9)词库扩充（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171317.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:43:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171317.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171317.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171317.html#Feedback</comments><slash:comments>3</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171317.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171317.html</trackback:ping><description><![CDATA[<h3>1、SharpICTCLAS中词库的扩充</h3>
<p>如果对SharpICTCLAS目前词库不满意的化，可以考虑扩充现有词库。扩充方法非常简单，代码如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">词库扩充</div>
</div>
<div class="content"><span style="color: #0000ff">static</span> <span style="color: #0000ff">void</span> Main(<span style="color: #0000ff">string</span>[] args) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">string</span> DictPath = Path.Combine(Environment.CurrentDirectory, <span style="color: #ff00ff">"Data"</span>) +<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Path.DirectorySeparatorChar; <br />
&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"正在读入字典，请稍候..."</span>); <br />
<br />
&nbsp;&nbsp; WordDictionary dict = <span style="color: #0000ff">new</span> WordDictionary(); <br />
&nbsp;&nbsp; dict.Load(DictPath + <span style="color: #ff00ff">"coreDict.dct"</span>); <br />
<br />
&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n向字典库插入&#8220;设计模式&#8221;一词..."</span>); <br />
&nbsp;&nbsp; dict.AddItem(<span style="color: #ff00ff">"设计模式"</span>, Utility.GetPOSValue(<span style="color: #ff00ff">"n"</span>), 10); <br />
<br />
&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n修改完成，将字典写入磁盘文件coreDictNew.dct，请稍候..."</span>); <br />
&nbsp;&nbsp; dict.Save(DictPath + <span style="color: #ff00ff">"coreDictNew.dct"</span>); <br />
<br />
&nbsp;&nbsp; Console.Write(<span style="color: #ff00ff">"按下回车键退出......"</span>); <br />
&nbsp;&nbsp; Console.ReadLine(); <br />
}</div>
</div>
<p>通过AddItem方法可以轻松实现添加新词汇，添加时除了要指明词外，还需指明词性、词频。</p>
<h3>2、其它工具</h3>
<p>SharpICTCLAS示例代码中还提供了一些用于对文件进行预处理的工具类PreProcessUtility，里面提供了将GB2312中繁体汉字转换为简体字的代码，以及将全角字母转换为半角字母的方法，除此之外，还提供了对HTML文件进行预处理，去除HTML标记的方法，用户可酌情使用。</p>
<p>　</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>有关SharpICTCLAS的系列文章到此为止就全部结束。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
 <img src ="http://www.blogjava.net/jiangyz/aggbug/171317.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:43 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171317.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(8)其它（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171316.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:38:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171316.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171316.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171316.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171316.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171316.html</trackback:ping><description><![CDATA[<p>前文对SharpICTCLAS中的一些主要内容做了介绍，本文介绍一下SharpICTCLAS中一些其它考虑，包括事件机制以及如何使用SharpICTCLAS。</p>
<h3>1、SharpICTCLAS中的事件</h3>
<p>分词过程比较复杂，所以很可能有人希望能够追踪分词的过程，设置代码断点比较麻烦，因此SharpICTCLAS中提供了事件机制，可以在分词的不同阶段触发相关事件，使用者可以订阅这些事件并输出中间结果供查错使用。</p>
<p>事件的阶段被定义在SegmentStage枚举当中，代码如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">SegmentStage程序</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> <span style="color: #0000ff">enum</span> SegmentStage <br />
{ <br />
&nbsp;&nbsp; BeginSegment,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//开始分词 </span><br />
&nbsp;&nbsp; AtomSegment,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//原子切分 </span><br />
&nbsp;&nbsp; GenSegGraph,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//生成SegGraph </span><br />
&nbsp;&nbsp; GenBiSegGraph,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//生成BiSegGraph </span><br />
&nbsp;&nbsp; NShortPath,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//N最短路径计算 </span><br />
&nbsp;&nbsp; BeforeOptimize,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//对N最短路径进一步整理得到的结果 </span><br />
&nbsp;&nbsp; OptimumSegment,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//初始OptimumSegmentGraph </span><br />
&nbsp;&nbsp; PersonAndPlaceRecognition, <span style="color: #008000">//人名与地名识别后的OptimumSegmentGraph </span><br />
&nbsp;&nbsp; BiOptimumSegment,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//生成BiOptimumSegmentGraph </span><br />
&nbsp;&nbsp; FinishSegment&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//完成分词，输出结果 </span><br />
}</div>
</div>
<p>分别对应分词过程中的10个阶段。</p>
<p>SharpICTCLAS中还定义了一个EventArgs，里面包含了两个元素，分别用来记录该事件元素所处的分词阶段以及该阶段的相关中间结果信息。中间结果信息使用的是string类型数据，日后可以考虑采用更复杂的表示形式输出中间结果。该事件元素定义如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">SegmentEventArgs的定义</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> SegmentEventArgs : EventArgs <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> SegmentStage Stage; <br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> <span style="color: #0000ff">string</span> Info = <span style="color: #ff00ff">""</span>; <br />
<br />
&nbsp;&nbsp; ...... <br />
}</div>
</div>
<p>剩下的工作就是定义委派并发布事件了。由于分词过程主要集中在两个类中：WordSegment类与Segment类，而用户通常只需要与WordSegment类打交道，因此WordSegment类中转发了Segment类产生的事件。</p>
<p>委派的定义以及事件的定义如下（部分）：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">程序</div>
</div>
<div class="content"><span style="color: #008000">//---定义委派 </span><br />
<span style="color: #0000ff">public</span> <span style="color: #0000ff">delegate</span> <span style="color: #0000ff">void</span> SegmentEventHandler(<span style="color: #0000ff">object</span> sender, SegmentEventArgs e); <br />
<br />
<span style="color: #008000">//---定义事件 </span><br />
<span style="color: #0000ff">public</span> <span style="color: #0000ff">event</span> SegmentEventHandler OnSegmentEvent; <br />
<br />
<span style="color: #008000">//---发布事件的方法 </span><br />
<span style="color: #0000ff">private</span> <span style="color: #0000ff">void</span> SendEvents(SegmentEventArgs e) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">if</span> (OnSegmentEvent != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; OnSegmentEvent(<span style="color: #0000ff">this</span>, e); <br />
} <br />
<br />
<span style="color: #008000">//---开始分词 </span><br />
<span style="color: #0000ff">private</span> <span style="color: #0000ff">void</span> OnBeginSegment(<span style="color: #0000ff">string</span> sentence) <br />
{ <br />
&nbsp;&nbsp; SendEvents(<span style="color: #0000ff">new</span> SegmentEventArgs(SegmentStage.BeginSegment, sentence)); <br />
} <br />
<br />
...... <br />
<br />
<span style="color: #008000">//---结束分词 </span><br />
<span style="color: #0000ff">private</span> <span style="color: #0000ff">void</span> OnFinishSegment(List&lt;WordResult[]&gt; m_pWordSeg) <br />
{ <br />
&nbsp;&nbsp; StringBuilder sb = <span style="color: #0000ff">new</span> StringBuilder(); <br />
&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> k = 0; k &lt; m_pWordSeg.Count; k++) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> j = 0; j &lt; m_pWordSeg[k].Length; j++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sb.Append(<span style="color: #0000ff">string</span>.Format(<span style="color: #ff00ff">"{0} /{1} "</span>, m_pWordSeg[k][j].sWord,&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Utility.GetPOSString(m_pWordSeg[k][j].nPOS))); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sb.Append(<span style="color: #ff00ff">"\r\n"</span>); <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; SendEvents(<span style="color: #0000ff">new</span> SegmentEventArgs(SegmentStage.FinishSegment, sb.ToString())); <br />
}</div>
</div>
<p>有了这些事件，用户可以根据需要订阅不同的事件来获取分词的中间结果，极大方便了程序调试工作。</p>
<h3>2、SharpICTCLAS的使用</h3>
<p>下面是一个使用SharpICTCLAS的示例代码：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">WordSegmentSample.cs</div>
</div>
<div class="content"><span style="color: #0000ff">using</span> System; <br />
<span style="color: #0000ff">using</span> System.Collections.Generic; <br />
<span style="color: #0000ff">using</span> System.Text; <br />
<strong><span style="color: #0000ff">using</span> SharpICTCLAS; </strong><br />
<br />
<span style="color: #0000ff">public</span> <span style="color: #0000ff">class</span> WordSegmentSample <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">private</span> <span style="color: #0000ff">int</span> nKind = 1;&nbsp; <span style="color: #008000">//在NShortPath方法中用来决定初步切分时分成几种结果 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">private</span> WordSegment wordSegment; <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 构造函数，在没有指明nKind的情况下，nKind 取 1 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> WordSegmentSample(<span style="color: #0000ff">string</span> dictPath) : <span style="color: #0000ff">this</span>(dictPath, 1) { } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 构造函数 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> WordSegmentSample(<span style="color: #0000ff">string</span> dictPath, <span style="color: #0000ff">int</span> nKind) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">this</span>.nKind = nKind; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <strong><span style="color: #0000ff">this</span>.wordSegment = <span style="color: #0000ff">new</span> WordSegment(); </strong><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//---------- 订阅分词过程中的事件 ---------- </span><br />
<strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wordSegment.OnSegmentEvent += <span style="color: #0000ff">new</span> SegmentEventHandler(<span style="color: #0000ff">this</span>.OnSegmentEventHandler); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wordSegment.InitWordSegment(dictPath); </strong><br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 开始分词 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">public</span> List&lt;WordResult[]&gt; Segment(<span style="color: #0000ff">string</span> sentence) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <strong><span style="color: #0000ff">return</span> wordSegment.Segment(sentence, nKind); </strong><br />
&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #008000">// 输出分词过程中每一步的中间结果 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//======================================================= </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">private</span> <span style="color: #0000ff">void</span> OnSegmentEventHandler(<span style="color: #0000ff">object</span> sender, SegmentEventArgs e) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">switch</span> (e.Stage) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.BeginSegment: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 原始句子：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info + <span style="color: #ff00ff">"\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.AtomSegment: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 原子切分：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.GenSegGraph: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 生成 segGraph：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.GenBiSegGraph: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 生成 biSegGraph：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.NShortPath: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== NShortPath 初步切分的到的 N 个结果：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.BeforeOptimize: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 经过数字、日期合并等策略处理后的 N 个结果：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.OptimumSegment: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 将 N 个结果归并入OptimumSegment：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.PersonAndPlaceRecognition: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 加入对姓名、翻译人名以及地名的识别：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.BiOptimumSegment: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 对加入对姓名、地名的OptimumSegment生成BiOptimumSegment：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">case</span> SegmentStage.FinishSegment: <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"\r\n==== 最终识别结果：\r\n"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(e.Info); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp; } <br />
} <br />
</div>
</div>
<p>从中我们可以看出，首先添加对SharpICTCLAS命名空间的引用，然后创建WordSegment类的一个实例。如果需要拦截分词过程中的事件的话，那么可以订阅WordSegment类的OnSegmentEvent事件，上面的代码用OnSegmentEventHandler方法订阅了事件，并且输出了所有分词Stage的中间结果。</p>
<p>WordSegmentSample类中的 nKind 属性是在NShortPath方法中用来决定初步切分时分成几种结果。如果不特殊指明，nKind取1，用户也可以自己定义一个1～10之间的整数（超过10，系统自动取10），数越大分词准确率越高（可以参考张华平的论文），但系统执行效率会下降。</p>
<p>WordSegment类的InitWordSegment方法主要用来初始化各个词典，用户在这里需要提供词典所在的目录信息，系统自动到该目录下搜索所有词典。</p>
<p>有了WordSegmentSample类，主程序如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">程序</div>
</div>
<div class="content"><span style="color: #0000ff">using</span> System; <br />
<span style="color: #0000ff">using</span> System.Collections.Generic; <br />
<span style="color: #0000ff">using</span> System.Text; <br />
<span style="color: #0000ff">using</span> System.IO; <br />
<span style="color: #0000ff">using</span> SharpICTCLAS; <br />
<br />
<span style="color: #0000ff">class</span> Program <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">static</span> <span style="color: #0000ff">void</span> Main(<span style="color: #0000ff">string</span>[] args) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; List&lt;WordResult[]&gt; result; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">string</span> DictPath = Path.Combine(Environment.CurrentDirectory, <span style="color: #ff00ff">"Data"</span>) +&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Path.DirectorySeparatorChar; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.WriteLine(<span style="color: #ff00ff">"正在初始化字典库，请稍候..."</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WordSegmentSample sample = <span style="color: #0000ff">new</span> WordSegmentSample(DictPath, 5); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; result = sample.Segment(@<span style="color: #ff00ff">"王晓平在1月份滦南大会上说的确实在理"</span>); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//---------- 输出结果 ---------- </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Console.WriteLine("\r\n==== 最终识别结果：\r\n"); </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//for (int i = 0; i &lt; result.Count; i++) </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//{ </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//&nbsp;&nbsp; for (int j = 0; j &lt; result[i].Length; j++) </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.Write("{0} /{1} ", result[i][j].sWord, Utility.GetPOSString(result[i][j].nPOS)); </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//&nbsp;&nbsp; Console.WriteLine(); </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//} </span><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.Write(<span style="color: #ff00ff">"按下回车键退出......"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Console.ReadLine(); <br />
&nbsp;&nbsp; } <br />
} <br />
</div>
</div>
<p>内容比较简单，此处就不再多说。由于我们在WordSegmentSample中订阅了所有阶段的事件，因此程序会输出整个过程各个阶段的中间结果，也包括最终分词结果，因此上面代码中我将输出结果部分的代码注释起来了。如果没有订阅任何事件的话，可以使用注释起来的这段代码输出分词最终结果。</p>
<p>该程序的执行结果如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">WordSegmentSample程序执行结果</div>
</div>
<div class="content">正在初始化字典库，请稍候... <br />
<br />
<span style="color: #008000">//==== 原始句子： </span><br />
<br />
王晓平在1月份滦南大会上说的确实在理 <br />
<br />
<br />
<span style="color: #008000">//==== 原子切分： </span><br />
<br />
始##始, 王, 晓, 平, 在, 1, 月, 份, 滦, 南, 大, 会, 上, 说, 的, 确, 实, 在, 理, 末##末,&nbsp; <br />
<br />
<br />
<span style="color: #008000">//==== 生成 segGraph： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 218.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 271.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp; 1900.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:月 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:月份 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 1234.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.00,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 813.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight:&nbsp; 14536.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp; 1333.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会 <br />
row: 11,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp; 6136.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 469.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp; 23706.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:上 <br />
row: 13,&nbsp; col: 14,&nbsp; eWeight:&nbsp; 17649.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说 <br />
row: 14,&nbsp; col: 15,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row: 14,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 210.00,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确 <br />
row: 15,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 181.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确 <br />
row: 15,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 361.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实 <br />
row: 16,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 357.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实 <br />
row: 16,&nbsp; col: 18,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 295.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实在 <br />
row: 17,&nbsp; col: 18,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row: 17,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.00,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理 <br />
row: 18,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 129.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理 <br />
row: 19,&nbsp; col: 20,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末 <br />
<br />
<br />
<span style="color: #008000">//==== 生成 biSegGraph： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.18,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始@王 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.46,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王@晓 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.93,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓@平 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.25,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平@在 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.74,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@未##数 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 6,&nbsp; eWeight: -27898.79,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数@月 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight: -27898.75,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数@月份 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.33,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:月@份 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.83,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:月份@滦 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.76,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份@滦 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 14.46,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦@南 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.19,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南@大 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.17,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南@大会 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.30,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大@会 <br />
row: 11,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.30,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大@会上 <br />
row: 12,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.11,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会@上 <br />
row: 13,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8.16,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会@上 <br />
row: 14,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.42,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上@说 <br />
row: 15,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.07,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:上@说 <br />
row: 16,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.05,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说@的 <br />
row: 16,&nbsp; col: 18,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.11,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说@的确 <br />
row: 17,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.10,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的@确 <br />
row: 17,&nbsp; col: 20,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.10,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的@确实 <br />
row: 18,&nbsp; col: 21,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.49,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确@实 <br />
row: 19,&nbsp; col: 21,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.63,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确@实 <br />
row: 18,&nbsp; col: 22,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.49,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确@实在 <br />
row: 19,&nbsp; col: 22,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.63,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确@实在 <br />
row: 20,&nbsp; col: 23,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.92,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在 <br />
row: 21,&nbsp; col: 23,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.98,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实@在 <br />
row: 20,&nbsp; col: 24,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.97,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在理 <br />
row: 21,&nbsp; col: 24,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.98,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实@在理 <br />
row: 22,&nbsp; col: 25,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.17,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实在@理 <br />
row: 23,&nbsp; col: 25,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.62,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@理 <br />
row: 24,&nbsp; col: 26,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 14.30,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理@末##末 <br />
row: 25,&nbsp; col: 26,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.95,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理@末##末 <br />
<br />
<br />
<span style="color: #008000">//==== NShortPath 初步切分的到的 N 个结果： </span><br />
<br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大会, 上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大, 会上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大会, 上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月, 份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
<br />
<br />
<span style="color: #008000">//==== 经过数字、日期合并等策略处理后的 N 个结果： </span><br />
<br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大会, 上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大会, 上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月, 份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
<br />
<br />
<span style="color: #008000">//==== 加入对姓名、翻译人名以及地名的识别： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 218.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.86,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.27,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 271.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 1234.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.00,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦 <br />
row:&nbsp; 8,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 20.37,&nbsp;&nbsp; nPOS: -28275,&nbsp;&nbsp; sWord:未##地 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 813.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight:&nbsp; 14536.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp; 1333.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 469.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp; 23706.00,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数 <br />
row: 13,&nbsp; col: 14,&nbsp; eWeight:&nbsp; 17649.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说 <br />
row: 14,&nbsp; col: 15,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row: 15,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 361.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实 <br />
row: 17,&nbsp; col: 18,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row: 17,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.00,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理 <br />
row: 18,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 129.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理 <br />
row: 19,&nbsp; col: 20,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末 <br />
<br />
<br />
<span style="color: #008000">//==== 生成 biSegGraph： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.18,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始@王 <br />
row:&nbsp; 0,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.88,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始@未##人 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.46,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王@晓 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.88,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王@未##人 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.93,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓@平 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 6,&nbsp; eWeight: -28270.43,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人@在 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 6,&nbsp; eWeight: -28270.43,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人@在 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.25,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平@在 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.01,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@未##时 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.01,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@未##时 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 9,&nbsp; eWeight: -29690.16,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时@份 <br />
row:&nbsp; 8,&nbsp; col: 10,&nbsp; eWeight: -29690.16,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时@滦 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.76,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份@滦 <br />
row:&nbsp; 8,&nbsp; col: 11,&nbsp; eWeight: -29690.17,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时@未##地 <br />
row:&nbsp; 9,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.76,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份@未##地 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 14.46,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦@南 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight: -28267.95,&nbsp;&nbsp; nPOS: -28275,&nbsp;&nbsp; sWord:未##地@大 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.19,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南@大 <br />
row: 11,&nbsp; col: 14,&nbsp; eWeight: -28266.85,&nbsp;&nbsp; nPOS: -28275,&nbsp;&nbsp; sWord:未##地@大会 <br />
row: 12,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.17,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南@大会 <br />
row: 13,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.30,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大@会上 <br />
row: 14,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.81,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会@未##数 <br />
row: 15,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.42,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上@说 <br />
row: 16,&nbsp; col: 17,&nbsp; eWeight: -27898.75,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数@说 <br />
row: 17,&nbsp; col: 18,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.05,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说@的 <br />
row: 18,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.10,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的@确实 <br />
row: 19,&nbsp; col: 20,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.92,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在 <br />
row: 19,&nbsp; col: 21,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.97,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在理 <br />
row: 20,&nbsp; col: 22,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.62,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@理 <br />
row: 21,&nbsp; col: 23,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 14.30,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理@末##末 <br />
row: 22,&nbsp; col: 23,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.95,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理@末##末 <br />
<br />
<br />
<span style="color: #008000">//==== 最终识别结果： </span><br />
<br />
王晓平 /nr 在 /p&nbsp; 1月份 /t&nbsp; 滦南 /ns 大会 /n&nbsp; 上 /v&nbsp; 说 /v&nbsp; 的 /uj 确实 /ad 在 /p&nbsp; 理 /n </div>
</div>
<p>　</p>
<p><font color="#ff0000">非常高兴在这最后一篇文章写完之时得到了张华平老师的授权。我会尽可能快的将SharpICTCLAS源文件放上来供大家测试使用的。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</font></p>
  <img src ="http://www.blogjava.net/jiangyz/aggbug/171316.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:38 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171316.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(7)OptimumSegment（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171314.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:34:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171314.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171314.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171314.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171314.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171314.html</trackback:ping><description><![CDATA[<p>上一篇文章说到经过NShortPath计算后，我们得到了数个候选分词方案，那么这么多个候选分词方案是如何最终成为一个分词结果的呢？其实这个过程是靠OptimumSegment完成的。SharpICTCLAS与ICTCLAS的OptimumSegment过程基本一样没有太大的变化。</p>
<h3>1、OptimumSegment的运算过程</h3>
<p>经过NShortPath处理后的多个结果首先会经过日期合并策略的处理，这就是前文说的GenerateWord方法完成的功能。在GenerateWord方法中可以看到如下命令：</p>
<p><code>&nbsp;m_graphOptimum.SetElement(pCur.row, pCur.col, ......);</code></p>
<p>它的功能就是将所有得到的多个分词方案合并归入一个<code> m_graphOptimum </code>属性，如下面的NShortPath运算结果，经过归并后，在<code> m_graphOptimum </code>属性中将包含所有红色词。</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">经过NShortPath处理后的初步分词结果</div>
</div>
<div class="content"><span style="color: #008000">//==== 原始句子： </span><br />
<br />
王晓平在1月份滦南大会上说的确实在理 <br />
<br />
<span style="color: #008000">//==== NShortPath 初步切分的到的 N 个结果： </span><br />
<br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大会, 上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大, 会上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月份, 滦, 南, 大会, 上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1, 月, 份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
<br />
<br />
<span style="color: #008000">//==== 经过数字、日期合并等策略处理后的 N 个结果： </span><br />
<br />
<font color="#ff0000"><strong>始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,</strong>&nbsp; </font><br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, <strong><font color="#ff0000">大会</font></strong>, 上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, <strong><font color="#ff0000">在理</font></strong>, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大会, 上, 说, 的, 确实, 在理, 末##末,&nbsp; <br />
始##始, 王, 晓, 平, 在, <strong><font color="#ff0000">1月</font></strong>, <strong><font color="#ff0000">份</font></strong>, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
</div>
</div>
<p>紧接着对归并后的<code> m_graphOptimum </code>进行人名与地名的识别，找出所有可能的人名、地名方案，经过人名、地名识别后的<code> m_graphOptimum </code>如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">加入人名、地名识别</div>
</div>
<div class="content"><span style="color: #008000">//==== 加入对姓名、翻译人名以及地名的识别： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 218.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:王 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.86,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:晓 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.27,&nbsp;&nbsp; nPOS: -28274,&nbsp;&nbsp; sWord:未##人 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 271.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:平 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -29696,&nbsp;&nbsp; sWord:未##时 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 1234.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.00,&nbsp;&nbsp; nPOS:&nbsp; 27136,&nbsp;&nbsp; sWord:滦 <br />
row:&nbsp; 8,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 20.37,&nbsp;&nbsp; nPOS: -28275,&nbsp;&nbsp; sWord:未##地 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 813.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:南 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight:&nbsp; 14536.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp; 1333.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 469.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp; 23706.00,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数 <br />
row: 13,&nbsp; col: 14,&nbsp; eWeight:&nbsp; 17649.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说 <br />
row: 14,&nbsp; col: 15,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row: 15,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 361.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实 <br />
row: 17,&nbsp; col: 18,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row: 17,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.00,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理 <br />
row: 18,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 129.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理 <br />
row: 19,&nbsp; col: 20,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末 <br />
</div>
</div>
<p>到此为止，<code> m_graphOptimum </code>包含了所有最终分词结果中可能包含的元素（人名、地名以及NShortPath筛选后所有可能组词方案），Segment类对这个<code> m_graphOptimum </code>再次使用NShortPath，并计算出最优结果作为最终的分词方案。</p>
<p>整个过程可从WordSegment类的Segment方法看出，SharpICTCLAS中该方法定义如下（经过简化）：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">分词主程序</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> List&lt;WordResult[]&gt; Segment(<span style="color: #0000ff">string</span> sentence, <span style="color: #0000ff">int</span> nKind) <br />
{ <br />
&nbsp;&nbsp; m_pNewSentence = Predefine.SENTENCE_BEGIN + sentence + Predefine.SENTENCE_END; <br />
&nbsp;&nbsp; <span style="color: #008000">//---初步分词 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> nResultCount = m_Seg.BiSegment(m_pNewSentence, m_dSmoothingPara, nKind); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---人名、地名识别 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> i = 0; i &lt; nResultCount; i++) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_uPerson.Recognition(m_Seg.m_pWordSeg[i], m_Seg.m_graphOptimum, ......); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_uTransPerson.Recognition(m_Seg.m_pWordSeg[i], m_Seg.m_graphOptimum, ......); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_uPlace.Recognition(m_Seg.m_pWordSeg[i], m_Seg.m_graphOptimum, ......); <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---最终优化 </span><br />
&nbsp;&nbsp; m_Seg.BiOptimumSegment(1, m_dSmoothingPara); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---词性标注 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> i = 0; i &lt; m_Seg.m_pWordSeg.Count; i++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_POSTagger.POSTagging(m_Seg.m_pWordSeg[i], m_dictCore, m_dictCore); <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">return</span> m_Seg.m_pWordSeg; <br />
}</div>
</div>
<h3>2、人名与地名的识别</h3>
<p>ICTCLAS中人名的识别采用的是模板匹配的方法，首先对初步分词得到的多的结果计算词性，然后根据词性串对人名信息进行匹配。整个运算过程如下：</p>
<p>首先定义了人名匹配模板：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">人名识别模板</div>
</div>
<div class="content"><span style="color: #0000ff">string</span>[] sPatterns = { <span style="color: #ff00ff">"BBCD"</span>, <span style="color: #ff00ff">"BBC"</span>, <span style="color: #ff00ff">"BBE"</span>, <span style="color: #ff00ff">"BBZ"</span>, <span style="color: #ff00ff">"BCD"</span>, <br />
<span style="color: #ff00ff">"BEE"</span>, <span style="color: #ff00ff">"BE"</span>, <span style="color: #ff00ff">"BG"</span>, <span style="color: #ff00ff">"BXD"</span>, <span style="color: #ff00ff">"BZ"</span>, <span style="color: #ff00ff">"CDCD"</span>, <span style="color: #ff00ff">"CD"</span>, <span style="color: #ff00ff">"EE"</span>, <span style="color: #ff00ff">"FB"</span>,&nbsp; <br />
<span style="color: #ff00ff">"Y"</span>, <span style="color: #ff00ff">"XD"</span>, <span style="color: #ff00ff">""</span> }; <br />
<span style="color: #008000">/*------------------------------------ <br />
The person recognition patterns set <br />
BBCD:姓+姓+名1+名2; <br />
BBE: 姓+姓+单名; <br />
BBZ: 姓+姓+双名成词; <br />
BCD: 姓+名1+名2; <br />
BE:&nbsp; 姓+单名; <br />
BEE: 姓+单名+单名;韩磊磊 <br />
BG:&nbsp; 姓+后缀 <br />
BXD: 姓+姓双名首字成词+双名末字 <br />
BZ:&nbsp; 姓+双名成词; <br />
B:&nbsp;&nbsp; 姓 <br />
CD:&nbsp; 名1+名2; <br />
EE:&nbsp; 单名+单名; <br />
FB:&nbsp; 前缀+姓 <br />
XD:&nbsp; 姓双名首字成词+双名末字 <br />
Y:&nbsp;&nbsp; 姓单名成词 <br />
------------------------------------*/</span></div>
</div>
<p>然后将初步分词得到的结果进行词性标注，清理掉其它不必要的信息后进行模板匹配得到人名：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">人名识别过程</div>
</div>
<div class="content"><span style="color: #008000">//==== 经过初步分词后的一个结果集 </span><br />
<br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
<br />
<span style="color: #008000">//==== 经过计算得到的m_nBestTag </span><br />
<br />
始##始, 王, 晓, 平, 在, 1月份, 滦, 南, 大, 会上, 说, 的, 确实, 在, 理, 末##末,&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff0000">B&nbsp;&nbsp; C&nbsp;&nbsp; D</font>&nbsp;&nbsp; M&nbsp;&nbsp; A&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; A&nbsp;&nbsp; A&nbsp;&nbsp; A&nbsp;&nbsp; A&nbsp;&nbsp;&nbsp;&nbsp; A&nbsp;&nbsp; A&nbsp;&nbsp; A&nbsp;&nbsp;&nbsp;&nbsp; A&nbsp;&nbsp; A <br />
<br />
<span style="color: #008000">//==== 经过模板匹配后识别出来的人名 </span><br />
<br />
王晓平</div>
</div>
<p>地名的识别与此类似，就不再多说。有关人名、地名识别的进一步内容可以参考：<a href="http://qxred.yculblog.com/post.1204714.html">http://qxred.yculblog.com/post.1204714.html</a>；《ICTCLAS 中科院分词系统 代码 注释 中文分词 词性标注》作者：风暴红QxRed 。</p>
<p>　</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>经过NShortPath得到的多个初步分词结果被归并入<code> m_graphOptimum </code>，然后经过人名与地名识别过程将所有可能的人名、地名也放入其中，最后通过OptimumSegment方法最终得到分词结果。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
 <img src ="http://www.blogjava.net/jiangyz/aggbug/171314.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:34 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171314.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(6)Segment（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171311.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:18:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171311.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171311.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171311.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171311.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171311.html</trackback:ping><description><![CDATA[<p>DynamicArray与NShortPath是ICTCLAS中的基础类，本人在完成了基础改造工作后，就着手开始对Segment分词进行移植与改造。SharpICTCLAS中的改造主要体现在以下几方面：</p>
<p><font color="#800080"><strong>1）合并不同类中的部分代码</strong></font></p>
<p>原有ICTCLAS中使用了SegGraph与Segment两个类完成分词过程，SegGraph类负责完成原子分词与segGraph的生成，而Segment类负责BiSegGraph的生成和NShortPath优化，而最终的人名、地名识别以及Optimum优化又分散在了Segment类与WordSegment类中。</p>
<p>SharpICTCLAS将原有SegGraph类与Segment合二为一，因为它们所作的工作仅仅是分词中的几个步骤而已。而WordSegment类中基本保留了原有内容，因为这个类更多的做一些外围工作。</p>
<p><font color="#800080"><strong>2）改造了程序中用到的部分数据结构</strong></font></p>
<p>原有ICTCLAS大量使用了数组与二维数组，由于数组的固有缺陷使得我们随处可以看到如此这般的数组定义：</p>
<p><code>m_pWordSeg = new PWORD_RESULT[MAX_SEGMENT_NUM];</code></p>
<p>由于不知道最终会分成几个词，所以定义数组时只能用最大的容量<code> MAX_SEGMENT_NUM </code>进行预设，所以一旦存在某些异常数据就会造成&#8220;溢出&#8221;错误。</p>
<p>而SharpICTCLAS中大量使用了<code> List&lt;int[]&gt; </code>的方式记录结果 ，范型的List首先可以确保结果集的数量可以动态调整而不用事先定义，另外每个结果的数组长度也可各不相同。</p>
<p>再有的改造就是在Segment类中使用了链表结构处理结果，这大大简化了原有ICTCLAS中的数组结构带来的种种问题。</p>
<p><font color="#800080"><strong>3）大量使用了静态方法</strong></font></p>
<p>由于某些过程的调用根本不需要建立对象，这些过程仅仅完成例行计算而已，因此将这些方法声明为静态方法更合适，何况静态方法的调用效率比实例方法高。因此本人在将ICTCLAS移植到C#平台上时，将尽可能的方法定义成静态方法。</p>
<p>下面我就说说SharpICTCLAS中Segment类的一些主要内容：</p>
<h3>1、主体部分</h3>
<p>比较典型的一个运算过程可以参考BiSegment方法，代码（经过简化）如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">Segment类的BiSegment方法</div>
</div>
<div class="content"><span style="color: #0000ff">public</span> <span style="color: #0000ff">int</span> BiSegment(<span style="color: #0000ff">string</span> sSentence, <span style="color: #0000ff">double</span> smoothPara, <span style="color: #0000ff">int</span> nKind) <br />
{ <br />
&nbsp;&nbsp; WordResult[] tmpResult; <br />
&nbsp;&nbsp; WordLinkedArray linkedArray; <br />
&nbsp;&nbsp; m_pWordSeg = <span style="color: #0000ff">new</span> List&lt;WordResult[]&gt;(); <br />
&nbsp;&nbsp; m_graphOptimum = <span style="color: #0000ff">new</span> RowFirstDynamicArray&lt;ChainContent&gt;(); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---原子分词 </span><br />
&nbsp;&nbsp; <font color="#ff0000">atomSegment = AtomSegment(sSentence); </font><br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---检索词库，加入所有可能分词方案并存入链表结构 </span><br />
&nbsp;&nbsp; segGraph = GenerateWordNet(atomSegment, coreDict); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---检索所有可能的两两组合 </span><br />
&nbsp;&nbsp; biGraphResult = BiGraphGenerate(segGraph, smoothPara, biDict, coreDict); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---N 最短路径计算出多个分词方案 </span><br />
&nbsp;&nbsp; NShortPath.Calculate(biGraphResult, nKind); <br />
&nbsp;&nbsp; List&lt;<span style="color: #0000ff">int</span>[]&gt; spResult = NShortPath.GetNPaths(Predefine.MAX_SEGMENT_NUM); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//---对结果进行优化，例如合并日期等工作 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> i = 0; i &lt; spResult.Count; i++) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font color="#ff0000">linkedArray = BiPath2LinkedArray(spResult[i], segGraph, atomSegment); </font><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; tmpResult = GenerateWord(spResult[i], linkedArray, m_graphOptimum); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (tmpResult != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg.Add(tmpResult); <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">return</span> m_pWordSeg.Count; <br />
}</div>
</div>
<p>从上面代码可以看出，已经将原有ICTCLAS的原子分词功能合并入Segment类了。</p>
<p>就拿&#8220;<font color="#0000ff">他在1月份大会上说的确实在理</font>&#8221;这句话来说，上面几个步骤得到的中间结果如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">程序</div>
</div>
<div class="content"><span style="color: #008000">//==== 原始句子： </span><br />
<br />
他在1月份大会上说的确实在理 <br />
<br />
<br />
<span style="color: #008000">//==== 原子切分： </span><br />
<br />
始##始, 他, 在, 1, 月, 份, 大, 会, 上, 说, 的, 确, 实, 在, 理, 末##末, <br />
<br />
<br />
<span style="color: #008000">//==== 生成 segGraph： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight: 329805.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp; 19823.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:他 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.00,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 5,&nbsp; eWeight:&nbsp;&nbsp; 1900.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:月 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:月份 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp; 1234.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp; 14536.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 1333.00,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp; 6136.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 469.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上 <br />
row:&nbsp; 8,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp; 23706.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:上 <br />
row:&nbsp; 9,&nbsp; col: 10,&nbsp; eWeight:&nbsp; 17649.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说 <br />
row: 10,&nbsp; col: 11,&nbsp; eWeight: 358156.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 210.00,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确 <br />
row: 11,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 181.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确 <br />
row: 11,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 361.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 357.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实 <br />
row: 12,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 295.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实在 <br />
row: 13,&nbsp; col: 14,&nbsp; eWeight:&nbsp; 78484.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在 <br />
row: 13,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.00,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理 <br />
row: 14,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp; 129.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理 <br />
row: 15,&nbsp; col: 16,&nbsp; eWeight:2079997.00,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4,&nbsp;&nbsp; sWord:末##末 <br />
<br />
<br />
<span style="color: #008000">//==== 生成 biSegGraph： </span><br />
<br />
row:&nbsp; 0,&nbsp; col:&nbsp; 1,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.37,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1,&nbsp;&nbsp; sWord:始##始@他 <br />
row:&nbsp; 1,&nbsp; col:&nbsp; 2,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.37,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:他@在 <br />
row:&nbsp; 2,&nbsp; col:&nbsp; 3,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.74,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@未##数 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 4,&nbsp; eWeight: -27898.79,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数@月 <br />
row:&nbsp; 3,&nbsp; col:&nbsp; 5,&nbsp; eWeight: -27898.75,&nbsp;&nbsp; nPOS: -27904,&nbsp;&nbsp; sWord:未##数@月份 <br />
row:&nbsp; 4,&nbsp; col:&nbsp; 6,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.33,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:月@份 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.83,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:月份@大 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 7,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.76,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份@大 <br />
row:&nbsp; 5,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 13.83,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:月份@大会 <br />
row:&nbsp; 6,&nbsp; col:&nbsp; 8,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 9.76,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:份@大会 <br />
row:&nbsp; 7,&nbsp; col:&nbsp; 9,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.30,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大@会 <br />
row:&nbsp; 7,&nbsp; col: 10,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.30,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:大@会上 <br />
row:&nbsp; 8,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.11,&nbsp;&nbsp; nPOS:&nbsp; 28160,&nbsp;&nbsp; sWord:大会@上 <br />
row:&nbsp; 9,&nbsp; col: 11,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8.16,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会@上 <br />
row: 10,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.42,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:会上@说 <br />
row: 11,&nbsp; col: 12,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.07,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:上@说 <br />
row: 12,&nbsp; col: 13,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.05,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说@的 <br />
row: 12,&nbsp; col: 14,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.11,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:说@的确 <br />
row: 13,&nbsp; col: 15,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.10,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的@确 <br />
row: 13,&nbsp; col: 16,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.10,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:的@确实 <br />
row: 14,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.49,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确@实 <br />
row: 15,&nbsp; col: 17,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.63,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确@实 <br />
row: 14,&nbsp; col: 18,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.49,&nbsp;&nbsp; nPOS:&nbsp; 25600,&nbsp;&nbsp; sWord:的确@实在 <br />
row: 15,&nbsp; col: 18,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.63,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确@实在 <br />
row: 16,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.92,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在 <br />
row: 17,&nbsp; col: 19,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.98,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实@在 <br />
row: 16,&nbsp; col: 20,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.97,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:确实@在理 <br />
row: 17,&nbsp; col: 20,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 10.98,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实@在理 <br />
row: 18,&nbsp; col: 21,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.17,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:实在@理 <br />
row: 19,&nbsp; col: 21,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.62,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:在@理 <br />
row: 20,&nbsp; col: 22,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 14.30,&nbsp;&nbsp; nPOS:&nbsp; 24832,&nbsp;&nbsp; sWord:在理@末##末 <br />
row: 21,&nbsp; col: 22,&nbsp; eWeight:&nbsp;&nbsp;&nbsp;&nbsp; 11.95,&nbsp;&nbsp; nPOS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0,&nbsp;&nbsp; sWord:理@末##末 <br />
<br />
<br />
<span style="color: #008000">//==== NShortPath 初步切分的到的 N 个结果： </span><br />
<br />
始##始, 他, 在, 1, 月份, 大会, 上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, 1, 月份, 大会, 上, 说, 的, 确实, 在理, 末##末, <br />
始##始, 他, 在, 1, 月份, 大, 会上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, 1, 月, 份, 大会, 上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, 1, 月份, 大, 会上, 说, 的, 确实, 在理, 末##末, <br />
<br />
<br />
<span style="color: #008000">//==== 经过数字、日期合并等策略处理后的 N 个结果： </span><br />
<br />
始##始, 他, 在, <font color="#ff0000">1月份</font>, 大会, 上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, <font color="#ff0000">1月份</font>, 大会, 上, 说, 的, 确实, 在理, 末##末, <br />
始##始, 他, 在, <font color="#ff0000">1月份</font>, 大, 会上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, <font color="#ff0000">1月</font>, 份, 大会, 上, 说, 的, 确实, 在, 理, 末##末, <br />
始##始, 他, 在, <font color="#ff0000">1月份</font>, 大, 会上, 说, 的, 确实, 在理, 末##末, <br />
</div>
</div>
<p>这些内容在前面的文章中已经涉及过，我这里主要说说SharpICTCLAS中两处地方的内容，分别是原子分词以及数字日期合并策略。</p>
<h3>2、原子分词</h3>
<p>原子分词看起来应当是程序中最简单的部分，无非是将汉字逐一分开。但是也是最值得改进的地方。SharpICTCLAS目前仍然沿用了原有ICTCLAS的算法并做了微小调整。但我对于 这种原子分词方法不太满意，如果有机会，可以考虑使用一系列正则表达式将某些&#8220;原子&#8221;词单独摘出来。比如&#8220;甲子&#8221;、&#8220;乙亥&#8221;等年份信息属于原子信息，还有URL、Email等都可以预先进行原子识别，这可以大大简化后续工作。因此日后可以考虑这方面的处理。</p>
<h3>3、对结果的处理</h3>
<p>ICTCLAS与SharpICTCLAS都通过NShortPath计算最短路径并将结果以数组的方式进行输出，数组仅仅记录了分词的位置，我们还需要通过一些后续处理手段将这些数组转换成&#8220;分词&#8221;结果。</p>
<p>原有ICTCLAS的实现如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">ICTCLAS对NShortPath结果的处理</div>
</div>
<div class="content"><span style="color: #0000ff">while</span> (i &lt; m_nSegmentCount) <br />
{ <br />
&nbsp; BiPath2UniPath(nSegRoute[i]);&nbsp; <span style="color: #008000">//Path convert to unipath </span><br />
&nbsp; GenerateWord(nSegRoute, i);&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Gernerate word according the Segmentation route </span><br />
&nbsp; i++; <br />
}</div>
</div>
<p>其中这个BiPath2UniPath方法做的工作可以用如下案例说明：</p>
<div class="code">
<div class="content">将BiPath转换为UniPath <br />
例如&#8220;他说的确实在理&#8221; <br />
<br />
BiPath：（0, 1, 2, 3, 6, 9, 11, 12） <br />
&nbsp;&nbsp; 0&nbsp;&nbsp; 1&nbsp;&nbsp; 2&nbsp;&nbsp; 3&nbsp;&nbsp; 4&nbsp;&nbsp;&nbsp; 5&nbsp;&nbsp; 6&nbsp;&nbsp;&nbsp; 7&nbsp;&nbsp; 8&nbsp;&nbsp;&nbsp; 9&nbsp;&nbsp; 10&nbsp;&nbsp; 11&nbsp; 12 <br />
始##始&nbsp; 他&nbsp; 说&nbsp; 的&nbsp; 的确&nbsp; 确&nbsp; 确实&nbsp; 实&nbsp; 实在&nbsp; 在&nbsp; 在理&nbsp; 理&nbsp; 末##末 <br />
<br />
经过转换后 <br />
UniPath：（0, 1, 2, 3, 4, 6, 7, 8） <br />
&nbsp;&nbsp; 0&nbsp;&nbsp; 1&nbsp;&nbsp; 2&nbsp;&nbsp; 3&nbsp; 4&nbsp;&nbsp; 5&nbsp;&nbsp; 6&nbsp;&nbsp; 7&nbsp;&nbsp; 8 <br />
始##始&nbsp; 他&nbsp; 说&nbsp; 的&nbsp; 确&nbsp; 实&nbsp; 在&nbsp; 理&nbsp; 末##末 <br />
</div>
</div>
<p>由此可见UniPath记录了针对原子分词的分割位置。而后面的GenerateWord方法又针对这个数组去做合并及优化工作。</p>
<p>本人在SharpICTCLAS的改造过程中发现在这里数组的表述方式给后续工作带来了很大的困难（可以考虑一下，让你合并链表中两个相邻结点简单呢还是数组中两个相邻结点简单？），所以我决定在SharpICTCLAS中将BiPath转换为链表结构供后续使用，实践证明简化了不少工作。</p>
<p>这点在BiSegment方法中有所体现，如下：</p>
<p><code>linkedArray = BiPath2LinkedArray(spResult[i], segGraph, atomSegment); </code></p>
<p>这样改造后，还使得原有ICTCLAS中<code> int *m_npWordPosMapTable; </code>不再需要，与其相关的代码也可以一并删除了。</p>
<h3>4、日期、数字合并策略</h3>
<p>数字、日期等合并以及拆分策略的实施是在GenerateWord方法中实现的，原有ICTCLAS中，该方法是一个超级庞大的方法，里面有不下6、7层的if嵌套、while嵌套等，分析其内部功能的工作异常复杂。经过一番研究后，我将其中的主要功能部分提取出来，改用了&#8220;管道&#8221;方式进行处理，简化了代码复杂度。但对于部分逻辑结构异常复杂的日期时间识别功能，SharpICTCLAS中仍然保留了绝大多数原始内容。</p>
<p>让我们先来看看原始ICTCLAS的GenerateWord方法（超级长的一个方法）：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">ICTCLAS中GenerateWord方法</div>
</div>
<div class="content"><span style="color: #008000">//Generate Word according the segmentation route </span><br />
<span style="color: #0000ff">bool</span> CSegment::GenerateWord(<span style="color: #0000ff">int</span> **nSegRoute, <span style="color: #0000ff">int</span> nIndex) <br />
{ <br />
&nbsp; unsigned <span style="color: #0000ff">int</span> i = 0, k = 0; <br />
&nbsp; <span style="color: #0000ff">int</span> j, nStartVertex, nEndVertex, nPOS; <br />
&nbsp; <span style="color: #0000ff">char</span> sAtom[WORD_MAXLENGTH], sNumCandidate[100], sCurWord[100]; <br />
&nbsp; ELEMENT_TYPE fValue; <br />
&nbsp; <span style="color: #0000ff">while</span> (nSegRoute[nIndex][i] !=&nbsp; - 1 &amp;&amp; nSegRoute[nIndex][i + 1] !=&nbsp; - 1 &amp;&amp; <br />
&nbsp;&nbsp;&nbsp; nSegRoute[nIndex][i] &lt; nSegRoute[nIndex][i + 1]) <br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; nStartVertex = nSegRoute[nIndex][i]; <br />
&nbsp;&nbsp;&nbsp; j = nStartVertex; <span style="color: #008000">//Set the start vertex </span><br />
&nbsp;&nbsp;&nbsp; nEndVertex = nSegRoute[nIndex][i + 1]; <span style="color: #008000">//Set the end vertex </span><br />
&nbsp;&nbsp;&nbsp; nPOS = 0; <br />
&nbsp;&nbsp;&nbsp; m_graphSeg.m_segGraph.GetElement(nStartVertex, nEndVertex, &amp;fValue, &amp;nPOS); <br />
&nbsp;&nbsp;&nbsp; sAtom[0] = 0; <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (j &lt; nEndVertex) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Generate the word according the segmentation route </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcat(sAtom, m_graphSeg.m_sAtom[j]); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; j++; <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[0] = 0; <span style="color: #008000">//Init the result ending </span><br />
&nbsp;&nbsp;&nbsp; strcpy(sNumCandidate, sAtom); <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (sAtom[0] != 0 &amp;&amp; (IsAllNum((unsigned <span style="color: #0000ff">char</span>*)sNumCandidate) || <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; IsAllChineseNum(sNumCandidate))) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Merge all seperate continue num into one number </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//sAtom[0]!=0: add in 2002-5-9 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(m_pWordSeg[nIndex][k].sWord, sNumCandidate); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Save them in the result segmentation </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i++; <span style="color: #008000">//Skip to next atom now&nbsp; </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sAtom[0] = 0; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (j &lt; nSegRoute[nIndex][i + 1]) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Generate the word according the segmentation route </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcat(sAtom, m_graphSeg.m_sAtom[j]); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; j++; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcat(sNumCandidate, sAtom); <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; unsigned <span style="color: #0000ff">int</span> nLen = strlen(m_pWordSeg[nIndex][k].sWord); <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nLen == 4 &amp;&amp; CC_Find(<span style="color: #ff00ff">"第上成&#177;—＋∶&#183;．／"</span>, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord) || nLen == 1 &amp;&amp; strchr(<span style="color: #ff00ff">"+-./"</span>, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[0])) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Only one word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, m_pWordSeg[nIndex][k].sWord); <span style="color: #008000">//Record current word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i--; <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <span style="color: #0000ff">if</span> (m_pWordSeg[nIndex][k].sWord[0] == 0) <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Have never entering the while loop </span><br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(m_pWordSeg[nIndex][k].sWord, sAtom); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Save them in the result segmentation </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, sAtom); <span style="color: #008000">//Record current word </span><br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//It is a num </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (strcmp(<span style="color: #ff00ff">"－－"</span>, m_pWordSeg[nIndex][k].sWord) == 0 || strcmp(<span style="color: #ff00ff">"—"</span>, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord) == 0 || m_pWordSeg[nIndex][k].sWord[0] == <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '-' &amp;&amp; m_pWordSeg[nIndex][k].sWord[1] == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//The delimiter "－－" </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS = 30464; <span style="color: #008000">//'w'*256;Set the POS with 'w' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i--; <span style="color: #008000">//Not num, back to previous word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Adding time suffix </span><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">char</span> sInitChar[3]; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; unsigned <span style="color: #0000ff">int</span> nCharIndex = 0; <span style="color: #008000">//Get first char </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sInitChar[nCharIndex] = m_pWordSeg[nIndex][k].sWord[nCharIndex]; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (sInitChar[nCharIndex] &lt; 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCharIndex += 1; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sInitChar[nCharIndex] = m_pWordSeg[nIndex][k].sWord[nCharIndex]; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCharIndex += 1; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sInitChar[nCharIndex] = '\0'; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (k &gt; 0 &amp;&amp; (abs(m_pWordSeg[nIndex][k - 1].nHandle) == 27904 || abs <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (m_pWordSeg[nIndex][k - 1].nHandle) == 29696) &amp;&amp; (strcmp(sInitChar,&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #ff00ff">"—"</span>) == 0 || sInitChar[0] == '-') &amp;&amp; (strlen <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (m_pWordSeg[nIndex][k].sWord) &gt; nCharIndex)) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//3-4月&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //27904='m'*256 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Split the sInitChar from the original word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(m_pWordSeg[nIndex][k + 1].sWord, m_pWordSeg[nIndex][k].sWord + <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCharIndex); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k + 1].dValue = m_pWordSeg[nIndex][k].dValue; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k + 1].nHandle = 27904; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nCharIndex] = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].dValue = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].nHandle = 30464; <span style="color: #008000">//'w'*256; </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_graphOptimum.SetElement(nStartVertex, nStartVertex + 1, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].dValue, m_pWordSeg[nIndex][k].nHandle, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nStartVertex += 1; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; k += 1; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nLen = strlen(m_pWordSeg[nIndex][k].sWord); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> ((strlen(sAtom) == 2 &amp;&amp; CC_Find(<span style="color: #ff00ff">"月日时分秒"</span>, sAtom)) || strcmp <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (sAtom, <span style="color: #ff00ff">"月份"</span>) == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//2001年 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcat(m_pWordSeg[nIndex][k].sWord, sAtom); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##时"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 29696; <span style="color: #008000">//'t'*256;//Set the POS with 'm' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <span style="color: #0000ff">if</span> (strcmp(sAtom, <span style="color: #ff00ff">"年"</span>) == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (IsYearTime(m_pWordSeg[nIndex][k].sWord)) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//strncmp(sAtom,"年",2)==0&amp;&amp; </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//1998年， </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcat(m_pWordSeg[nIndex][k].sWord, sAtom); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##时"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 29696; <span style="color: #008000">//Set the POS with 't' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##数"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 27904; <span style="color: #008000">//Set the POS with 'm' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i--; <span style="color: #008000">//Can not be a time word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//早晨/t&nbsp; 五点/t&nbsp; </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (strcmp(m_pWordSeg[nIndex][k].sWord + strlen <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (m_pWordSeg[nIndex][k].sWord) - 2, <span style="color: #ff00ff">"点"</span>) == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##时"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 29696; <span style="color: #008000">//Set the POS with 't' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (!CC_Find(<span style="color: #ff00ff">"∶&#183;．／"</span>, m_pWordSeg[nIndex][k].sWord + nLen - 2) &amp;&amp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nLen - 1] != '.' &amp;&amp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nLen - 1] != '/') <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##数"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 27904; <span style="color: #008000">//'m'*256;Set the POS with 'm' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <span style="color: #0000ff">if</span> (nLen &gt; strlen(sInitChar)) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Get rid of . example 1. </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (m_pWordSeg[nIndex][k].sWord[nLen - 1] == '.' || <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nLen - 1] == '/') <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nLen - 1] = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">else</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].sWord[nLen - 2] = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; strcpy(sCurWord, <span style="color: #ff00ff">"未##数"</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPOS =&nbsp; - 27904; <span style="color: #008000">//'m'*256;Set the POS with 'm' </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i--; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i--; <span style="color: #008000">//Not num, back to previous word </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fValue = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nEndVertex = nSegRoute[nIndex][i + 1]; <span style="color: #008000">//Ending POS changed to latter </span><br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].nHandle = nPOS; <span style="color: #008000">//Get the POS of current word </span><br />
&nbsp;&nbsp;&nbsp; m_pWordSeg[nIndex][k].dValue = fValue;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//(int)(MAX_FREQUENCE*exp(-fValue));//Return the frequency of current word </span><br />
&nbsp;&nbsp;&nbsp; m_graphOptimum.SetElement(nStartVertex, nEndVertex, fValue, nPOS, sCurWord); <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Generate optimum segmentation graph according the segmentation result </span><br />
&nbsp;&nbsp;&nbsp; i++; <span style="color: #008000">//Skip to next atom </span><br />
&nbsp;&nbsp;&nbsp; k++; <span style="color: #008000">//Accept next word </span><br />
&nbsp; } <br />
&nbsp; m_pWordSeg[nIndex][k].sWord[0] = 0; <br />
&nbsp; m_pWordSeg[nIndex][k].nHandle =&nbsp; - 1; <span style="color: #008000">//Set ending </span><br />
&nbsp; <span style="color: #0000ff">return</span> <span style="color: #0000ff">true</span>; <br />
}</div>
</div>
<p>SharpICTCLAS中，对这段超长代码进行了功能剥离，采用一种&#8220;流水线&#8221;式的处理流程，不同工作部分负责处理不同功能，而将处理结果节节传递（很象设计模式中的职责链模式），这样使得整体结构变的清晰起来。SharpICTCLAS中GenerateWord方法定义如下：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">SharpICTCLAS中的GenerateWord方法</div>
</div>
<div class="content"><span style="color: #0000ff">private</span> <span style="color: #0000ff">static</span> WordResult[] GenerateWord(<span style="color: #0000ff">int</span>[] uniPath, WordLinkedArray linkedArray,&nbsp; <br />
&nbsp;&nbsp; RowFirstDynamicArray&lt;ChainContent&gt; m_graphOptimum) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">if</span> (linkedArray.Count == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> <span style="color: #0000ff">null</span>; <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//Merge all seperate continue num into one number </span><br />
&nbsp;&nbsp; MergeContinueNumIntoOne(<span style="color: #0000ff">ref</span> linkedArray); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//The delimiter "－－" </span><br />
&nbsp;&nbsp; ChangeDelimiterPOS(<span style="color: #0000ff">ref</span> linkedArray); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//如果前一个词是数字，当前词以&#8220;－&#8221;或&#8220;-&#8221;开始，并且不止这一个字符， </span><br />
&nbsp;&nbsp; <span style="color: #008000">//那么将此&#8220;－&#8221;符号从当前词中分离出来。 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//例如 &#8220;3 / -4 / 月&#8221;需要拆分成&#8220;3 / - / 4 / 月&#8221; </span><br />
&nbsp;&nbsp; SplitMiddleSlashFromDigitalWords(<span style="color: #0000ff">ref</span> linkedArray); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//1、如果当前词是数字，下一个词是&#8220;月、日、时、分、秒、月份&#8221;中的一个，则合并,且当前词词性是时间 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//2、如果当前词是可以作为年份的数字，下一个词是&#8220;年&#8221;，则合并，词性为时间，否则为数字。 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//3、如果最后一个汉字是"点" ，则认为当前数字是时间 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//4、如果当前串最后一个汉字不是"∶&#183;．／"和半角的'.''/'，那么是数 </span><br />
&nbsp;&nbsp; <span style="color: #008000">//5、当前串最后一个汉字是"∶&#183;．／"和半角的'.''/'，且长度大于1，那么去掉最后一个字符。例如"1." </span><br />
&nbsp;&nbsp; <font color="#ff0000">CheckDateElements</font>(<span style="color: #0000ff">ref</span> linkedArray); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">//-------------------------------------------------------------------- </span><br />
&nbsp;&nbsp; <span style="color: #008000">//遍历链表输出结果 </span><br />
&nbsp;&nbsp; WordResult[] result = <span style="color: #0000ff">new</span> WordResult[linkedArray.Count]; <br />
<br />
&nbsp;&nbsp; WordNode pCur = linkedArray.first; <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> i = 0; <br />
&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pCur != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WordResult item = <span style="color: #0000ff">new</span> WordResult(); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; item.sWord = pCur.theWord.sWord; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; item.nPOS = pCur.theWord.nPOS; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; item.dValue = pCur.theWord.dValue; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; result[i] = item; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_graphOptimum.SetElement(pCur.row, pCur.col, <span style="color: #0000ff">new</span> ChainContent(item.dValue, item.nPOS, pCur.sWordInSegGraph)); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pCur = pCur.next; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i++; <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">return</span> result; <br />
}</div>
</div>
<p>从中可以看到linkedArray作为&#8220;绣球&#8221;在多个处理流程中被传递和加工，最终输出相应的结果。只是CheckDateElement方法内容涉及到的东西太多，因此目前看来其实现仍有些臃肿，日后可以进一步进行功能的剥离。</p>
<p>　</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>1）Segment类是SharpICTCLAS中最大的一个类，实现了分词过程中一些关键的步骤。</p>
<p>2）Segment类对原有ICTCLAS中的代码做了大量修改，力争通过新的数据结构简化原有操作。</p>
<p>3）Segment中定义了部分静态方法以提高调用效率。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
 <img src ="http://www.blogjava.net/jiangyz/aggbug/171311.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:18 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171311.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(5)NShortPath-2(转)</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171309.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 12:05:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171309.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171309.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171309.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171309.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171309.html</trackback:ping><description><![CDATA[<p>在了解了1-最短路径的计算方式后，我们看看N-最短路径的计算。</p>
<p>N-最短路径的计算方式与1-最短路径基本相同，只是在记录所有可达路径时，要保留最短的前N个结果。让我们仍然以上篇文章的案例来看看如何实现N-最短路径的运算。</p>
<h3>1、数据表示</h3>
<p>这里我们仍然沿用前文例子，对下图求N-最短路径，每条边的权重已经在图中标注出来了。</p>
<p><img height="107" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308002.gif" width="383" border="0" /></p>
<p>（图一）</p>
<h3>2、运算过程</h3>
<p>仍然象1-最短路径一样，计算出每个结点上可达N-最短路的PreNode。我们这里以2-最短路径为例：</p>
<p>1）首先计算出每个结点上所有可达路径的可能路径长度并按从小到大排序。</p>
<p>2）根据排序结果取前2种路径长度并分别记录进各结点的PreNode队列。如下图：</p>
<p><img height="278" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308008.gif" width="456" border="0" /></p>
<p>（图二）</p>
<p>在该图中，到达1号、2号、3号结点的路径虽然有多条，但长度只有一种长度，但到达4号&#8220;D&#8221;结点的路径长度有两种，即长度可能是3也可能是4，此时在&#8220;最短路&#8221;处（index＝0）记录长度为3时的PreNode，在&#8220;次短路&#8221;处（index＝1）处记录长度为4时的PreNode，依此类推。</p>
<p>值得注意的是，此时用来记录PreNode的坐标已经由前文求&#8220;1-最短路径&#8221;时的一个数（<font color="#0000ff">ParentNode值</font>)变为2个数（<font color="#0000ff">ParentNode值以及index值</font>）。</p>
<p>如图二所示，到达6号&#8220;末&#8221;结点的次短路径由两个ParentNode，一个是index=0中的4号结点，一个是index=1的5号结点，它们都使得总路径长度为6。</p>
<h3>3、具体实现</h3>
<p>在具体实现上述算法时，首先要求得所有可能路径的长度，这在SharpICTCLAS中是通过一个EnQueueCurNodeEdges方法实现的，上篇文章给出了它的简化版本的代码，这里将完整的求N-最短路径的EnQueueCurNodeEdges方法代码放上来：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">程序</div>
</div>
<div class="content"><span style="color: #008000">//==================================================================== </span><br />
<span style="color: #008000">// 将所有到当前结点（nCurNode）可能的边根据eWeight排序并压入队列 </span><br />
<span style="color: #008000">//==================================================================== </span><br />
<span style="color: #0000ff">private</span> <span style="color: #0000ff">static</span> <span style="color: #0000ff">void</span> EnQueueCurNodeEdges(<span style="color: #0000ff">ref</span> CQueue queWork, <span style="color: #0000ff">int</span> nCurNode) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> nPreNode; <br />
&nbsp;&nbsp; <span style="color: #0000ff">double</span> eWeight; <br />
&nbsp;&nbsp; ChainItem&lt;ChainContent&gt; pEdgeList; <br />
<br />
&nbsp;&nbsp; queWork.Clear(); <br />
&nbsp;&nbsp; pEdgeList = m_apCost.GetFirstElementOfCol(nCurNode); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">// Get all the edges </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pEdgeList != <span style="color: #0000ff">null</span> &amp;&amp; pEdgeList.col == nCurNode) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPreNode = pEdgeList.row;&nbsp; <span style="color: #008000">// </span><font color="#ff0000">很特别的命令，利用了row与col的关系</font><span style="color: #008000"> </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; eWeight = pEdgeList.Content.eWeight; <span style="color: #008000">//Get the eWeight of edges </span><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span> (<span style="color: #0000ff">int</span> i = 0; i &lt; <font color="#ff0000">m_nValueKind</font>; i++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// 第一个结点，没有PreNode，直接加入队列 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nPreNode == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queWork.EnQueue(<span style="color: #0000ff">new</span> QueueElement(nPreNode, i, eWeight)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// 如果PreNode的Weight == Predefine.INFINITE_VALUE，则没有必要继续下去了 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (m_pWeight[nPreNode - 1][i] == Predefine.INFINITE_VALUE) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queWork.EnQueue(<span style="color: #0000ff">new</span> QueueElement(nPreNode, i, eWeight + m_pWeight[nPreNode - 1][i])); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pEdgeList = pEdgeList.next; <br />
&nbsp;&nbsp; } <br />
}</div>
</div>
<p>这里的m_nValueKind就是你希望N-最短路径保留几种路径的结果。</p>
<p>当m_nValueKind＝2时，我们求得了2-最短路径，路径长度有两种，分别长度为5和6，而路径总共有6条，如下：</p>
<p>最短路径：</p>
<ul>
    <li><font color="#0000ff">0, 1, 3, 6,</font>
    <li><font color="#0000ff">0, 1, 2, 3, 6,</font>
    <li><font color="#0000ff">0, 1, 2, 4, 5, 6,</font> </li>
</ul>
<p>========================</p>
<p>次短路径</p>
<ul>
    <li><font color="#0000ff">0, 1, 2, 4, 6,</font>
    <li><font color="#0000ff">0, 1, 3, 4, 5, 6,</font>
    <li><font color="#0000ff">0, 1, 2, 3, 4, 5, 6,</font> </li>
</ul>
<h3>4、求解N-最短路径</h3>
<p>N-最短路径的最终输出与上篇文章完全一致，仍然是借助堆栈完成的。只不过根据index的取值的不同，分多次完成压栈与出栈的操作而已。此处就不再重复，感兴趣的可以再看看上一篇文章。</p>
<p>　</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>1）N-最短路径中用来记录PreNode的坐标由前文求&#8220;1-最短路径&#8221;时的一个数（<font color="#0000ff">ParentNode值</font>)变为2个数（<font color="#0000ff">ParentNode值以及index值</font>）。</p>
<p>2）N-最短路径并不意味着求得得路径只有N条。</p>
<p>3）文中只演示了2-最短路径，但可以推广到N-最短路径。程序求得的3-最短路径中，最长的路径为：（0, 1, 3, 4, 6）与（0, 1, 2, 3, 4, 6），它们的长度都是7。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
<img src ="http://www.blogjava.net/jiangyz/aggbug/171309.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 20:05 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171309.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>SharpICTCLAS分词系统简介(4)NShortPath-1（转）</title><link>http://www.blogjava.net/jiangyz/archive/2007/12/28/171299.html</link><dc:creator>刀剑笑</dc:creator><author>刀剑笑</author><pubDate>Fri, 28 Dec 2007 11:38:00 GMT</pubDate><guid>http://www.blogjava.net/jiangyz/archive/2007/12/28/171299.html</guid><wfw:comment>http://www.blogjava.net/jiangyz/comments/171299.html</wfw:comment><comments>http://www.blogjava.net/jiangyz/archive/2007/12/28/171299.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/jiangyz/comments/commentRss/171299.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/jiangyz/services/trackbacks/171299.html</trackback:ping><description><![CDATA[<p>N-最短路径中文词语粗分是分词过程中非常重要的一步，而原有ICTCLAS中该部分代码也是我认为最难读懂的部分，到现在还有一些方法没有弄明白，因此我几乎重写了NShortPath类。要想说明N-最短路径代码是如何工作的并不容易，所以分成两步分，本部分先说说SharpICTCLAS中1-最短路径是如何实现的，在下一篇文章中再引申到N-最短路径。</p>
<h3>1、数据表示</h3>
<p>这里我们求最短路的例子使用如下的有向图，每条边的权重已经在图中标注出来了。</p>
<p><img height="107" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308002.gif" width="383" border="0" /></p>
<p>（图一）</p>
<p>根据上篇文章内容，该图该可以等价于如下的二维表格表示：</p>
<p><img height="255" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308003.gif" width="372" border="0" /></p>
<p>（图二）</p>
<p>而对应于该表格的是一个ColumnFirstDynamicArray，共有10个结点，每个结点的取值如下表所示：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">该示例对应的ColumnFirstDynamicArray</div>
</div>
<div class="content">row:0,&nbsp; col:1,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: 始@A <br />
row:1,&nbsp; col:2,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: A@B <br />
row:1,&nbsp; col:3,&nbsp; eWeight:2,&nbsp; nPOS:0,&nbsp; sWord: A@C <br />
row:2,&nbsp; col:3,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: B@C <br />
row:2,&nbsp; col:4,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: B@D <br />
row:3,&nbsp; col:4,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: C@D <br />
row:4,&nbsp; col:5,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: D@E <br />
row:3,&nbsp; col:6,&nbsp; eWeight:2,&nbsp; nPOS:0,&nbsp; sWord: C@末 <br />
row:4,&nbsp; col:6,&nbsp; eWeight:3,&nbsp; nPOS:0,&nbsp; sWord: D@末 <br />
row:5,&nbsp; col:6,&nbsp; eWeight:1,&nbsp; nPOS:0,&nbsp; sWord: E@末</div>
</div>
<h3>2、计算出每个结点上可达最短路的PreNode</h3>
<p>在求解N-最短路径之前，先看看如何求最短PreNode。如下图所示：</p>
<p><img height="195" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308004.gif" width="456" border="0" /></p>
<p>（图三）</p>
<p>首先计算出到达每个结点的最短路径，并将该结点的父结点压入该结点所对应的队列。例如3号&#8220;C&#8221;结点，到达该结点的最短路径长度为3，它的Parent结点可以是1号&#8220;A&#8221;结点，也可以是2号&#8220;B&#8221;结点，因此在队列中存储了两个PreNode结点。</p>
<p>而在实际计算时，如何知道到达3号&#8220;C&#8221;结点的路径有几条呢？其实我们首先计算所有到达3号&#8220;C&#8221;结点的路径长度，并按照路径长度从小到大的顺序排列（所有这些都是靠CQueue这个类完成的），然后从队列中依次向后取值，取出所有最短路径对应的PreNode。</p>
<p>计算到当前结点（nCurNode）可能的边，并根据总路径长度由小到大压入队列的代码如下（经过简化）：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">EnQueueCurNodeEdges方法</div>
</div>
<div class="content"><span style="color: #008000">//==================================================================== </span><br />
<span style="color: #008000">// 将所有到当前结点（nCurNode）可能的边根据eWeight排序并压入队列 </span><br />
<span style="color: #008000">//==================================================================== </span><br />
<span style="color: #0000ff">private</span> <span style="color: #0000ff">void</span> EnQueueCurNodeEdges(<span style="color: #0000ff">ref</span> CQueue queWork, <span style="color: #0000ff">int</span> nCurNode) <br />
{ <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> nPreNode; <br />
&nbsp;&nbsp; <span style="color: #0000ff">double</span> eWeight; <br />
&nbsp;&nbsp; ChainItem&lt;ChainContent&gt; pEdgeList; <br />
<br />
&nbsp;&nbsp; queWork.Clear(); <br />
&nbsp;&nbsp; pEdgeList = m_apCost.GetFirstElementOfCol(nCurNode); <br />
<br />
&nbsp;&nbsp; <span style="color: #008000">// 获取所有到当前结点的边 </span><br />
&nbsp;&nbsp; <span style="color: #0000ff">while</span> (pEdgeList != <span style="color: #0000ff">null</span> &amp;&amp; pEdgeList.col == nCurNode) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nPreNode = pEdgeList.row;&nbsp; <span style="color: #008000">// </span><font color="#ff0000">很特别的命令，利用了row与col的关系</font><span style="color: #008000"> </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; eWeight = pEdgeList.Content.eWeight; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// 第一个结点，没有PreNode，直接加入队列 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nPreNode == 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queWork.EnQueue(<span style="color: #0000ff">new</span> QueueElement(nPreNode, eWeight)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">break</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queWork.EnQueue(<span style="color: #0000ff">new</span> QueueElement(nPreNode, eWeight + m_pWeight[nPreNode - 1])); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pEdgeList = pEdgeList.next; <br />
&nbsp;&nbsp; } <br />
} <br />
</div>
</div>
<p>这段代码中有一行很特别的命令，就是用红颜色注释的那句&#8220;nPreNode = pEdgeList.row;&#8221;，让我琢磨了半天终于弄明白原有ICTCLAS用意的一句话。这需要参考本文图二，为了方便起见，我将它挪到了这里：</p>
<p><img height="255" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308003.gif" width="372" border="0" /></p>
<p>注意<strong><font color="#0000ff"> 3 </font></strong>号&#8220;C&#8221;结点在该表中处于第<font color="#0000ff"><strong> 3 </strong></font>列，所有可以到达该结点的<strong><font color="#0000ff">边</font></strong>就是该列中的元素（目前有两个元素&#8220;A@C&#8221;与&#8220;B@C&#8221;）。而与 <strong><font color="#0000ff">3</font></strong> 号&#8220;C&#8221;结点构成这两条边的PreNode结点恰恰是这两个元素的&#8220;<strong><font color="#ff0000">行号</font></strong>&#8221;，分别是 <strong><font color="#ff0000">1 </font></strong>号&#8220;A&#8221;结点与 <strong><font color="#ff0000">2</font></strong> 号&#8220;B&#8221;结点。正是因为这种特殊的对应关系，为我们检索所有可达边提供了便捷的方法。阅读上面那段代码务必把握好这种关系。</p>
<h3>3、求解最短路径</h3>
<p>求出每个结点上最短路径的PreNode后就需要据此推导出完整的最短路径。原ICTCLAS代码中是靠GetPaths方法实现的，只是到现在我也没有读懂这个方法的代码究竟想干什么 ，只知道它用了若干个while，若干个if，若干个嵌套...（将ICTCLAS中的GetPaths放上来，如果谁读懂了，回头给我讲讲 ，感觉应该和我的算法差不多）。</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">NShortPath.cpp程序中的GetPaths方法</div>
</div>
<div class="content"><span style="color: #0000ff">void</span> CNShortPath::GetPaths(unsigned <span style="color: #0000ff">int</span> nNode, unsigned <span style="color: #0000ff">int</span> nIndex, <span style="color: #0000ff">int</span> <br />
&nbsp; **nResult, <span style="color: #0000ff">bool</span> bBest) <br />
{ <br />
&nbsp; CQueue queResult; <br />
&nbsp; unsigned <span style="color: #0000ff">int</span> nCurNode, nCurIndex, nParentNode, nParentIndex, nResultIndex = 0; <br />
<br />
&nbsp; <span style="color: #0000ff">if</span> (m_nResultCount &gt;= MAX_SEGMENT_NUM) <br />
&nbsp; <span style="color: #008000">//Only need 10 result </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> ; <br />
&nbsp; nResult[m_nResultCount][nResultIndex] =&nbsp; - 1; <span style="color: #008000">//Init the result&nbsp; </span><br />
&nbsp; queResult.Push(nNode, nIndex); <br />
&nbsp; nCurNode = nNode; <br />
&nbsp; nCurIndex = nIndex; <br />
&nbsp; <span style="color: #0000ff">bool</span> bFirstGet; <br />
&nbsp; <span style="color: #0000ff">while</span> (!queResult.IsEmpty()) <br />
&nbsp; { <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (nCurNode &gt; 0) <br />
&nbsp;&nbsp;&nbsp; <span style="color: #008000">// </span><br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Get its parent and store them in nParentNode,nParentIndex </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (m_pParent[nCurNode - 1][nCurIndex].Pop(&amp;nParentNode, &amp;nParentIndex, 0, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">false</span>, <span style="color: #0000ff">true</span>) !=&nbsp; - 1) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCurNode = nParentNode; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCurIndex = nParentIndex; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nCurNode &gt; 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queResult.Push(nCurNode, nCurIndex); <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nCurNode == 0) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Get a path and output </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nResult[m_nResultCount][nResultIndex++] = nCurNode; <span style="color: #008000">//Get the first node </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bFirstGet = <span style="color: #0000ff">true</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nParentNode = nCurNode; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (queResult.Pop(&amp;nCurNode, &amp;nCurIndex, 0, <span style="color: #0000ff">false</span>, bFirstGet) !=&nbsp; - 1) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nResult[m_nResultCount][nResultIndex++] = nCurNode; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bFirstGet = <span style="color: #0000ff">false</span>; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nParentNode = nCurNode; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nResult[m_nResultCount][nResultIndex] =&nbsp; - 1; <span style="color: #008000">//Set the end </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_nResultCount += 1; <span style="color: #008000">//The number of result add by 1 </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (m_nResultCount &gt;= MAX_SEGMENT_NUM) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Only need 10 result </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> ; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nResultIndex = 0; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nResult[m_nResultCount][nResultIndex] =&nbsp; - 1; <span style="color: #008000">//Init the result&nbsp; </span><br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (bBest) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">//Return the best result, ignore others </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">return</span> ; <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; queResult.Pop(&amp;nCurNode, &amp;nCurIndex, 0, <span style="color: #0000ff">false</span>, <span style="color: #0000ff">true</span>); <span style="color: #008000">//Read the top node </span><br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (queResult.IsEmpty() == <span style="color: #0000ff">false</span> &amp;&amp; (m_pParent[nCurNode - <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1][nCurIndex].IsSingle() || m_pParent[nCurNode - 1][nCurIndex].IsEmpty <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (<span style="color: #0000ff">true</span>))) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queResult.Pop(&amp;nCurNode, &amp;nCurIndex, 0); <span style="color: #008000">//Get rid of it </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queResult.Pop(&amp;nCurNode, &amp;nCurIndex, 0, <span style="color: #0000ff">false</span>, <span style="color: #0000ff">true</span>); <span style="color: #008000">//Read the top node </span><br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (queResult.IsEmpty() == <span style="color: #0000ff">false</span> &amp;&amp; m_pParent[nCurNode - <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1][nCurIndex].IsEmpty(<span style="color: #0000ff">true</span>) == <span style="color: #0000ff">false</span>) <br />
&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m_pParent[nCurNode - 1][nCurIndex].Pop(&amp;nParentNode, &amp;nParentIndex, 0, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">false</span>, <span style="color: #0000ff">false</span>); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCurNode = nParentNode; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nCurIndex = nParentIndex; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">if</span> (nCurNode &gt; 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; queResult.Push(nCurNode, nCurIndex); <br />
&nbsp;&nbsp;&nbsp; } <br />
&nbsp; } <br />
}</div>
</div>
<p>我重写了求解最短路径的方法，其算法表述如下：</p>
<p><img height="199" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308005.gif" width="516" border="0" /></p>
<p>（图四）</p>
<p>1）首先将最后一个元素压入堆栈（本例中是6号结点），什么时候这个元素弹出堆栈，什么时候整个任务结束。</p>
<p>2）对于每个结点的PreNode队列，维护了一个当前指针，初始状态都指向PreNode队列中第一个元素。</p>
<p>3）从右向左依次取出PreNode队列中的当前元素并压入堆栈，并将队列指针重新指向队列中第一个元素。如图四：6号元素PreNode是3，3号元素PreNode是1，1号元素PreNode是0。</p>
<p>4）当第一个元素压入堆栈后，输出堆栈内容即为一条队列。本例中0, 1, 3, 6便是一条最短路径。</p>
<p>5）将堆栈中的内容依次弹出，每弹出一个元素，就将当时压栈时对应的PreNode队列指针下移一格。如果到了末尾无法下移，则继续执行第5步，如果仍然可以移动，则执行第3步。</p>
<p>对于本例，先将&#8220;0&#8221;弹出堆栈，该元素对应的是1号&#8220;A&#8221;结点的PreNode队列，该队列的当前指针已经无法下移，因此继续弹出堆栈中的&#8220;1&#8221; ；该元素对应3号&#8220;C&#8221;结点，因此将3号&#8220;C&#8221;结点对应的PreNode队列指针下移。由于可以移动，因此将队列中的2压入队列，2号&#8220;B&#8221;结点的PreNode是1，因此再压入1，依次类推，直到0被压入，此时又得到了一条最短路径，那就是0，1，2，3，6。如下图：</p>
<p><img height="196" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308006.gif" width="512" border="0" /></p>
<p>（图五）</p>
<p>再往下，0、1、2都被弹出堆栈，3被弹出堆栈后，由于它对应的6号元素PreNode队列记录指针仍然可以下移，因此将5压入堆栈并依次将其PreNode入栈，直到0被入栈。此时输出第3条最短路径：0, 1, 2, 4, 5, 6。入下图：</p>
<p><img height="195" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/0308007.gif" width="512" border="0" /></p>
<p>（图六）</p>
<p>输出完成后，紧接着又是出栈，此时已经没有任何堆栈元素对应的PreNode队列指针可以下移，于是堆栈中的最后一个元素6也被弹出堆栈，此时输出工作完全结束。我们得到了3条最短路径，分别是：</p>
<ul>
    <li><font color="#0000ff">0, 1, 3, 6,</font>
    <li><font color="#0000ff">0, 1, 2, 3, 6,</font>
    <li><font color="#0000ff">0, 1, 2, 4, 5, 6,</font> </li>
</ul>
<p>让我们看看在SharpICTCLAS中，该算法是如何实现的：</p>
<div class="code">
<div class="title">
<div style="float: right"><img class="copyCodeImage" alt="" src="http://www.cnblogs.com/images/cnblogs_com/zhenyulu/200701/copycode.gif" align="absMiddle" name="ccImage" /> <a onclick="CopyCode(this)" href="javascript:">Copy Code</a></div>
<div style="clear: none">SharpICTCLAS中的GetPaths方法</div>
</div>
<div class="content"><span style="color: #008000">//==================================================================== </span><br />
<span style="color: #008000">// 注：index ＝ 0 : 最短的路径； index = 1 ： 次短的路径 </span><br />
<span style="color: #008000">//&nbsp;&nbsp;&nbsp;&nbsp; 依此类推。index &lt;= this.m_nValueKind </span><br />
<span style="color: #008000">//==================================================================== </span><br />
<span style="color: #0000ff">public</span> List&lt;<span style="color: #0000ff">int</span>[]&gt; GetPaths(<span style="color: #0000ff">int</span> index) <br />
{ <br />
&nbsp;&nbsp; Stack&lt;PathNode&gt; stack = <span style="color: #0000ff">new</span> Stack&lt;PathNode&gt;(); <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span> curNode = m_nNode - 1, curIndex = index; <br />
&nbsp;&nbsp; QueueElement element; <br />
&nbsp;&nbsp; PathNode node; <br />
&nbsp;&nbsp; <span style="color: #0000ff">int</span>[] aPath; <br />
&nbsp;&nbsp; List&lt;<span style="color: #0000ff">int</span>[]&gt; result = <span style="color: #0000ff">new</span> List&lt;<span style="color: #0000ff">int</span>[]&gt;(); <br />
<br />
&nbsp;&nbsp; element = m_pParent[curNode - 1][curIndex].GetFirst(); <br />
&nbsp;&nbsp; <span style="color: #0000ff">while</span> (element != <span style="color: #0000ff">null</span>) <br />
&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// ---------- 通过压栈得到路径 ----------- </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; stack.Push(<span style="color: #0000ff">new</span> PathNode(curNode, curIndex)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; stack.Push(<span style="color: #0000ff">new</span> PathNode(element.nParent, element.nIndex)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; curNode = element.nParent; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">while</span> (curNode != 0) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; element = m_pParent[element.nParent - 1][element.nIndex].GetFirst(); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; stack.Push(<span style="color: #0000ff">new</span> PathNode(element.nParent, element.nIndex)); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; curNode = element.nParent; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// -------------- 输出路径 -------------- </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PathNode[] nArray = stack.ToArray();&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; aPath = <span style="color: #0000ff">new</span> <span style="color: #0000ff">int</span>[nArray.Length]; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">for</span>(<span style="color: #0000ff">int</span> i=0; i&lt;aPath.Length; i++) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; aPath[i] = nArray[i].nParent; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; result.Add(aPath); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #008000">// -------------- 出栈以检查是否还有其它路径 -------------- </span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="color: #0000ff">do</span> <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; node = stack.Pop(); <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; curNode = node.nParent; <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; curIndex = node.nIndex; <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } <span style="color: #0000ff">while</span> (curNode &lt; 1 || (stack.Count != 0 &amp;&amp; !m_pParent[curNode - 1][curIndex].CanGetNext)); <br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; element = m_pParent[curNode - 1][curIndex].GetNext(); <br />
&nbsp;&nbsp; } <br />
<br />
&nbsp;&nbsp; <span style="color: #0000ff">return</span> result; <br />
}</div>
</div>
<p>注意，上面的代码是N-最短路径的，比起1-最短路径来说增加了点复杂度，但总体架构不变。这段代码将原有ICTCLAS的70多行求解路径代码缩短到了40多行。</p>
<ul>
    <li><font color="#800080"><strong>小结</strong></font> </li>
</ul>
<p>1）N-最短路径的求解比较复杂，本文先从求解1-最短路径着手，说明SharpICTCLAS是如何计算的，在下篇文章中将推广到N-最短路径。</p>
<p>2）1-最短路径并不意味着只有一条最短路径，而是路径最短的若干条路径。就如本文案例所示，1-最短路径算法最终求得了3条路径，它们的长度都是5，因此都是最短路径。<br />
<br />
来源：http://www.cnblogs.com/zhenyulu/category/85598.html</p>
 <img src ="http://www.blogjava.net/jiangyz/aggbug/171299.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/jiangyz/" target="_blank">刀剑笑</a> 2007-12-28 19:38 <a href="http://www.blogjava.net/jiangyz/archive/2007/12/28/171299.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>