随笔-8  评论-2  文章-24  trackbacks-0

    1。用jacob.
    其实jacob是一个bridage,连接java和com或者win32函数的一个中间件,jacob并不能直接抽取word,excel等文件,需要自己写dll哦,不过已经有为你写好的了,就是jacob的作者一并提供了。
   jacob下载:
http://www.matrix.org.cn/down_view.asp?id=13
    下载了jacob并放到指定的路径之后(dll放到path,jar文件放到classpath),就可以写你自己的抽取程序了,下面是一个例子:

import java.io.File;

import com.jacob.com.*;

import com.jacob.activeX.*;

 

public class FileExtracter

{

       public static void main(String[] args)

       {

              ActiveXComponent app = new ActiveXComponent("Word.Application");

              String inFile = "c:\\test.doc";

              String tpFile = "c:\\temp.htm";

              String otFile = "c:\\temp.xml";

              boolean flag = false;

              try

              {

                     app.setProperty("Visible", new Variant(false));

                     Object docs = app.getProperty("document").toDispatch();

                     Object doc = Dispatch

                                   .invoke(docs, "Open", Dispatch.Method, new Object[]

                                   {inFile, new Variant(false), new Variant(true)}, new int[1])

                                   .toDispatch();

                     Dispatch.invoke(doc, "SaveAs", Dispatch.Method, new Object[]

                     {tpFile, new Variant(8)}, new int[1]);

                     Variant f = new Variant(false);

                     Dispatch.call(doc, "Close", f);

                     flag = true;

              } catch (Exception e)

              {

                     e.printStackTrace();

              } finally

              {

                     app.invoke("Quit", new Variant[]

                     {});

              }

 

       }

}




    2。用apache的poi来抽取word,excel。
    poi是apache的一个项目,不过就算用poi你可能都觉得很烦,不过不要紧,这里提供了更加简单的一个接口给你:
    下载经过封装后的poi包:
http://www.matrix.org.cn/down_view.asp?id=14
    下载之后,放到你的classpath就可以了,下面是如何使用它的一个例子:
       

import java.io.*;

import org.textmining.text.extraction.WordExtractor;

/**

 * <p>

 * Title: pdf extraction

 * </p>

 * <p>

 * Description: email:chris@matrix.org.cn

 * </p>

 * <p>

 * Copyright: Matrix Copyright (c) 2003

 * </p>

 * <p>

 * Company: Matrix.org.cn

 * </p>

 *

 * @author chris

 * @version 1.0,who use this example pls remain the declare

 */

 

public class PdfExtractor

{

       public PdfExtractor()

       {

       }

       public static void main(String args[]) throws Exception

       {

              FileInputStream in = new FileInputStream("c:\\a.doc");

              WordExtractor extractor = new WordExtractor();

              String str = extractor.extractText(in);

              System.out.println("the result length is" + str.length());

              System.out.println("the result is" + str);

       }

}




    3。pdfbox-用来抽取pdf文件
   但是pdfbox对中文支持还不好,先下载pdfbox:

http://www.matrix.org.cn/down_view.asp?id=12
下面是一个如何使用pdfbox抽取pdf文件的例子:
 

import org.pdfbox.pdmodel.PDdocument

import org.pdfbox.pdfparser.PDFParser;

import java.io.*;

import org.pdfbox.util.PDFTextStripper;

import java.util.Date;

/**

 * <p>

 * Title: pdf extraction

 * </p>

 * <p>

 * Description: email:chris@matrix.org.cn

 * </p>

 * <p>

 * Copyright: Matrix Copyright (c) 2003

 * </p>

 * <p>

 * Company: Matrix.org.cn

 * </p>

 *

 * @author chris

 * @version 1.0,who use this example pls remain the declare

 */

 

public class PdfExtracter

{

 

       public PdfExtracter()

       {

       }

       public String GetTextFromPdf(String filename) throws Exception

       {

              String temp=null;

              PDdocumentnbsppdfdocumentnull;

              FileInputStream is=new FileInputStream(filename);

              PDFParser parser = new PDFParser( is );

              parser.parse();

              pdfdocumentnbsp= parser.getPDdocument);

              ByteArrayOutputStream out = new ByteArrayOutputStream();

              OutputStreamWriter writer = new OutputStreamWriter( out );

              PDFTextStripper stripper = new PDFTextStripper();

              stripper.writeText(pdfdocumentgetdocument), writer );

              writer.close();

              byte[] contents = out.toByteArray();

             

              String ts=new String(contents);

              System.out.println("the string length is"+contents.length+"\n");

              return ts;

       }

       public static void main(String args[])

       {

              PdfExtracter pf=new PdfExtracter();

              PDdocumentnbsppdfdocumentnbsp= null;

             

              try

              {

                     String ts=pf.GetTextFromPdf("c:\\a.pdf");

                     System.out.println(ts);

              }

              catch(Exception e)

              {

                     e.printStackTrace();

              }

       }

}


     4.抽取支持中文的pdf文件-xpdf
   xpdf是一个开源项目,我们可以调用他的本地方法来实现抽取中文pdf文件。
下载xpdf函数包:
http://www.matrix.org.cn/down_view.asp?id=15
同时需要下载支持中文的补丁包:
http://www.matrix.org.cn/down_view.asp?id=16
   按照readme放好中文的patch,就可以开始写调用本地方法的java程序了
下面是一个如何调用的例子:

import java.io.*;

/**

 * <p>

 * Title: pdf extraction

 * </p>

 * <p>

 * Description: email:chris@matrix.org.cn

 * </p>

 * <p>

 * Copyright: Matrix Copyright (c) 2003

 * </p>

 * <p>

 * Company: Matrix.org.cn

 * </p>

 *

 * @author chris

 * @version 1.0,who use this example pls remain the declare

 */

 

public class PdfWin

{

       public PdfWin()

       {

       }

       public static void main(String args[]) throws Exception

       {

              String PATH_TO_XPDF = "C:\\Program Files\\xpdf\\pdftotext.exe";

              String filename = "c:\\a.pdf";

              String[] cmd = new String[]

              {PATH_TO_XPDF, "-enc", "UTF-8", "-q", filename, "-"};

              Process p = Runtime.getRuntime().exec(cmd);

              BufferedInputStream bis = new BufferedInputStream(p.getInputStream());

              InputStreamReader reader = new InputStreamReader(bis, "UTF-8");

              StringWriter out = new StringWriter();

              char[] buf = new char[10000];

              int len;

              while ((len = reader.read(buf)) >= 0)

              {

                     // out.write(buf, 0, len);

                     System.out.println("the length is" + len);

              }

              reader.close();

              String ts = new String(buf);

              System.out.println("the str is" + ts);

       }

}

 
posted on 2006-11-27 10:26 MyBox 阅读(194) 评论(0)  编辑  收藏

只有注册用户登录后才能发表评论。


网站导航: