我在使用ICTCLAS.dll做网页内容分词的时候,出现一下异常
An unexpected exception has been detected in native code outside the VM.
Unexpected Signal : EXCEPTION_ACCESS_VIOLATION (0xc0000005) occurred at PC=0x18473252
Function=Ordinal5+0x3252
Library=D:\workspace3\Lucene_191\ICTCLAS.dll

Current Java thread:
at com.xjt.nlp.word.ICTCLAS.paragraphProcess(Native Method)
- locked <0x1003db78> (a com.xjt.nlp.word.ICTCLAS)
at org.apache.lucene.analysis.cn.TjuChineseAnalyzer.tokenStream(TjuChineseAnalyzer.java:59)
at org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:162)
at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:93)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:450)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:436)
at ch14.performance.index.IndexPerformanceTest.addDocument(IndexPerformanceTest.java:192)
at ch14.performance.index.IndexPerformanceTest.indexFiles(IndexPerformanceTest.java:214)
at ch14.performance.index.IndexPerformanceTest.indexFiles(IndexPerformanceTest.java:207)
at ch14.performance.index.IndexPerformanceTest.indexFiles(IndexPerformanceTest.java:207)
at ch14.performance.index.IndexPerformanceTest.toIndex(IndexPerformanceTest.java:141)
at ch14.performance.index.IndexPerformanceTest.toIndex(IndexPerformanceTest.java:127)
at ch14.performance.index.IndexPerformanceTest.main(IndexPerformanceTest.java:228)

Dynamic libraries:
0x00400000 - 0x00407000 C:\j2sdk1.4.2_03\bin\javaw.exe
0x77F80000 - 0x77FFC000 C:\WINNT\system32\ntdll.dll
0x796D0000 - 0x79735000 C:\WINNT\system32\ADVAPI32.dll
0x77E60000 - 0x77F32000 C:\WINNT\system32\KERNEL32.dll
0x786F0000 - 0x78768000 C:\WINNT\system32\RPCRT4.dll
0x77DF0000 - 0x77E59000 C:\WINNT\system32\USER32.dll
0x77F40000 - 0x77F7C000 C:\WINNT\system32\GDI32.dll
0x78000000 - 0x78045000 C:\WINNT\system32\MSVCRT.dll
0x75E00000 - 0x75E1A000 C:\WINNT\system32\IMM32.DLL
0x6C330000 - 0x6C338000 C:\WINNT\system32\LPK.DLL
0x65D20000 - 0x65D74000 C:\WINNT\system32\USP10.dll
0x08000000 - 0x08138000 C:\j2sdk1.4.2_03\jre\bin\client\jvm.dll
0x77530000 - 0x77560000 C:\WINNT\system32\WINMM.dll
0x6BD00000 - 0x6BD0D000 C:\WINNT\system32\SYNCOR11.DLL
0x10000000 - 0x10007000 C:\j2sdk1.4.2_03\jre\bin\hpi.dll
0x007F0000 - 0x007FE000 C:\j2sdk1.4.2_03\jre\bin\verify.dll
0x00800000 - 0x00819000 C:\j2sdk1.4.2_03\jre\bin\java.dll
0x00820000 - 0x0082D000 C:\j2sdk1.4.2_03\jre\bin\zip.dll
0x18470000 - 0x1853F000 D:\workspace3\Lucene_191\ICTCLAS.dll
0x777C0000 - 0x777DE000 C:\WINNT\system32\WINSPOOL.DRV
0x79B20000 - 0x79B31000 C:\WINNT\system32\MPR.DLL
0x71710000 - 0x71794000 C:\WINNT\system32\COMCTL32.dll
0x77900000 - 0x77923000 C:\WINNT\system32\imagehlp.dll
0x72960000 - 0x7298D000 C:\WINNT\system32\DBGHELP.dll
0x687E0000 - 0x687EB000 C:\WINNT\system32\PSAPI.DLL

Heap at VM Abort:
Heap
def new generation total 576K, used 429K [0x10010000, 0x100b0000, 0x104f0000)
eden space 512K, 74% used [0x10010000, 0x1006f0f8, 0x10090000)
from space 64K, 77% used [0x100a0000, 0x100ac5e0, 0x100b0000)
to space 64K, 0% used [0x10090000, 0x10090000, 0x100a0000)
tenured generation total 1408K, used 128K [0x104f0000, 0x10650000, 0x14010000)
the space 1408K, 9% used [0x104f0000, 0x105103e8, 0x10510400, 0x10650000)
compacting perm gen total 4096K, used 1782K [0x14010000, 0x14410000, 0x18010000)
the space 4096K, 43% used [0x14010000, 0x141cd860, 0x141cda00, 0x14410000)

Local Time = Wed May 10 19:19:16 2006
Elapsed Time = 8
#
# The exception above was detected in native code outside the VM
#
# Java VM: Java HotSpot(TM) Client VM (1.4.2_03-b02 mixed mode)
#
# An error report file has been saved as hs_err_pid2120.log.
# Please refer to the file for further information.
#

不知作者或者其他朋友有没有遇到,网上也有人遇到这种情况。

这个问题很好。。。
这个问题是中科院分词工具的BUG

举个例子,如果你用这个分词工具分下面的词,就一定会报错

“5/”
“6/”

类似于这样一个数字加一个“/”就会有问题。由于DLL抛出的异常是无法被JVM捕获的,因此JVM就被强行停了下来。。

建议是在进行分词前要对句子进行预处理,如
1)全角到半角的替换
2)去除多余的空格
3)一次分词的句子不要太长
4)特殊符号的转换


不过在新版本的分词工具中,好像这样的BUG已经被改进了。
我手上有一个师弟做的JAVA版的仿中科院分词工具,而且把词库也进行了翻译,可以使用TXT文件做词库,并且能添加新词。希望有机会也能拿出来和朋友分享:)这位师弟和我说他打算把这个放到SOURCEFORGE上,到时候大家可以下载了:)

如果可能,我也想在我的下一本新的Lucene的书中放上,呵呵。当然得先问问他的意见,哈哈