DANCE WITH JAVA

开发出高质量的系统

常用链接

统计

随笔 - 239
文章 - 0
评论 - 664
引用 - 0

积分与排名

积分 - 1000334
排名 - 34

好友之家

lucene 索引非txt文档 (pdf word rtf html xml)

搜索要首先要索引，索引的话最简单的方式是索引txt文件，上文已经介绍了。这里介绍一下一些其它格式的文档的索引，例如ms word ,pdf ,rtf等。
索引方法：就是先把各种文档先转化成纯文本再索引，所以关键在转换上。幸好java世界中有太多的开源工程，很多都可以拿来直接使用。下边分别介绍一下：
写在所有之前：下边所有介绍中的is参数都是inputStream，就是被索引的文件。
word文档：
把word文档转换成纯文本的开源工程可以使用：POI 或者TextMining
POI的使用方法：

WordDocument wd = new WordDocument(is);

StringWriter docTextWriter = new StringWriter();

wd.writeAllText(new PrintWriter(docTextWriter));

docTextWriter.close();

bodyText = docTextWriter.toString();

TextMining的使用方法更简单：

bodyText = new WordExtractor().extractText(is);

PDF文档：
转换PDF文档可以使用的类库是PDFbox

COSDocument cosDoc = null;
PDFParser parser = new PDFParser(is);
parser.parse();

cosDoc = parser.getDocument()

if (cosDoc.isEncrypted()) {

DecryptDocument decryptor = new DecryptDocument(cosDoc);

decryptor.decryptDocument(password);

}

PDFTextStripper stripper = new PDFTextStripper();

String docText = stripper.getText(new PDDocument(cosDoc));

RTF文档：
rtf的转换则在javax中就有

DefaultStyledDocument styledDoc = new DefaultStyledDocument();

new RTFEditorKit().read(is, styledDoc, 0);

String bodyText = styledDoc.getText(0, styledDoc.getLength());

这样就可以索引各种格式的文本了

html和xml的处理方法同样
不同的是html的可用类库是：JTidy
Xml可用的类库是SAX和digester

posted on 2007-06-14 13:27 dreamstone 阅读(8058) 评论(9) 编辑收藏所属分类: 搜索引擎lucence

# re: lucene 索引非txt文档 (pdf word rtf html xml) 2007-08-08 17:31 123

不支持中文...... 回复更多评论

# re: lucene 索引非txt文档 (pdf word rtf html xml) 2008-01-03 18:10 qq307059755

上述代码是否有误?
运行到
String docText = stripper.getText(new PDDocument(cosDoc));
时报:
正在索引： C:\word\source\基于内容的图像检索研究进展.pdf 文件

Exception in thread "AWT-EventQueue-0" java.lang.NoClassDefFoundError: org/fontbox/cmap/CMapParser

at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:534)

at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:412)

at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)

at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)

at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)

at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)

at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)

at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)

at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)

at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)

at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)

at textsearch.TextSearchFrame.indexFile(TextSearchFrame.java:235)
.......... 回复更多评论

# re: lucene 索引非txt文档 (pdf word rtf html xml) 2008-01-03 19:57 qq307059755

问题已解决：在下载的pdfbox包中包括org.fontbox.cmap.CMapParser 类回复更多评论

# re: lucene 索引非txt文档 (pdf word rtf html xml) 2008-07-01 11:33 wrj

需要包更多的jar包，根据提示的类在包的整个目录下找需要的包，如搜CMapParser ,会搜出一个jar包，将其引入你的应用即可！回复更多评论

# re: lucene 索引非txt文档 (pdf word rtf html xml) 2008-07-23 17:18 大

是吗，我下的所有pdfbox里面都没有那个类，是那个事吗？@qq307059755
回复更多评论

# re: lucene 索引非txt文档 (pdf word rtf html xml)[未登录] 2008-09-09 11:19 hp

org.fontbox.cmap.CMapParser 这个类；我也没有找到；怎么回事啊；高手指点一下啊；我的也报这个错误啊；回复更多评论

# re: lucene 索引非txt文档 (pdf word rtf html xml) 2010-02-25 11:10 shusnail

我也报这个错回复更多评论

# re: lucene 索引非txt文档 (pdf word rtf html xml) 2011-07-29 11:04 拉面tina

如果是中文的要怎么检索啊，用的是lucene2.0.0，O(∩_∩)O 回复更多评论

# re: lucene 索引非txt文档 (pdf word rtf html xml) 2014-03-13 22:13 第三方第三方

打发第三方的说法是打发的撒回复更多评论

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园博客园最新博文博问管理
相关文章: lucene入门合集 lucene的中文分词器 lucene的丰富的各种查询（二） lucene的丰富的各种查询(一) 比较lucene各种英文分析器Analyzer lucene建立索引时候的用到的一些文档和目录操作 lucene 索引非txt文档 (pdf word rtf html xml) apache lucene 的核心类 apache lucene 一个最简单的实例 apache lucene介绍