梦幻e家人

java咖啡
随笔 - 15, 文章 - 0, 评论 - 11, 引用 - 0
数据加载中……

2008年8月6日

Lucene关键字高亮显示

Lucene关键字高亮显示
 
 

在Lucene的org.apache.lucene.search.highlight包中提供了关于高亮显示检索关键字的工具。使用百度、Google搜索的时候,检索结果显示的时候,在摘要中实现与关键字相同的词条进行高亮显示,百度和Google指定红色高亮显示。

有了Lucene提供的高亮显示的工具,可以很方便地实现高亮显示的功能。

高亮显示,就是根据用户输入的检索关键字,检索找到该关键字对应的检索结果文件,提取对应于该文件的摘要文本,然后根据设置的高亮格式,将格式写入到摘要文本中对应的与关键字相同或相似的词条上,在网页上显示出来,该摘要中的与关键字有关的文本就会以高亮的格式显示出来。

Lucene中org.apache.lucene.search.highlight.SimpleHTMLFormatter类可以构造一个高亮格式,这是最简单的构造方式,例如:

SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>", "</font>");

构造方法声明为public SimpleHTMLFormatter(String preTag, String postTag),因为这种高亮格式是依赖于网页文件的,HTML文件中是以标记(tag)来标识的,即存在一个preTag和一个postTag。

上面构造的高亮格式是摘要中出现的关键字使用红色来显示,区分其它文本。

通过构造好的高亮格式对象,来构造一个org.apache.lucene.search.highlight.Highlighter实例,然后根据对检索结果得到的Field的文本内容(这里是指摘要文本)进行切分,找到与检索关键字相同或相似的词条,将高亮格式加入到摘要文本中,返回一个新的、带有格式的摘要文本,在网页上就可以呈现高亮显示。

下面实现一个简单的例子,展示实现高亮显示的处理过程。

测试类如下所示:

package org.shirdrn.lucene.learn.highlight;

import java.io.IOException;
import java.io.StringReader;

import net.teamhot.lucene.ThesaurusAnalyzer;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;

public class MyHighLighter {

private String indexPath = "F:\\index";
private Analyzer analyzer;
private IndexSearcher searcher;

public MyHighLighter(){
   analyzer = new ThesaurusAnalyzer();
}

public void createIndex() throws IOException {   // 该方法建立索引
   IndexWriter writer = new IndexWriter(indexPath,analyzer,true);
   Document docA = new Document();
   String fileTextA = "因为火烧云总是燃烧着消失在太阳冲下地平线的时刻,然后便是宁静的自然的天籁,没有谁会在这样的时光的镜片里伤感自语,因为灿烂给人以安静的舒适感。";
   Field fieldA = new Field("contents", fileTextA, Field.Store.YES,Field.Index.TOKENIZED);
   docA.add(fieldA);
  
   Document docB = new Document();
   String fileTextB = "因为带有以伤痕为代价的美丽风景总是让人不由地惴惴不安,紧接着袭面而来的抑或是病痛抑或是灾难,没有谁会能够安逸着恬然,因为模糊让人撕心裂肺地想呐喊。";
   Field fieldB = new Field("contents", fileTextB, Field.Store.YES,Field.Index.TOKENIZED);
   docB.add(fieldB);
  
   Document docC = new Document();
   String fileTextC = "我喜欢上了一个人孤独地行游,在梦与海洋的交接地带炽烈燃烧着。"+
   "因为,一条孤独的鱼喜欢上了火焰的颜色,真是荒唐地不合逻辑。";
   Field fieldC = new Field("contents", fileTextC, Field.Store.YES,Field.Index.TOKENIZED);
   docC.add(fieldC);
  
   writer.addDocument(docA);
   writer.addDocument(docB);
   writer.addDocument(docC);
   writer.optimize();
   writer.close();
}

public void search(String fieldName,String keyword) throws CorruptIndexException, IOException, ParseException{   // 检索的方法,并实现高亮显示
   searcher = new IndexSearcher(indexPath);
   QueryParser queryParse = new QueryParser(fieldName, analyzer);     //   构造QueryParser,解析用户输入的检索关键字
   Query query = queryParse.parse(keyword);
   Hits hits = searcher.search(query);
   for(int i=0;i<hits.length();i++){
    Document doc = hits.doc(i);
    String text = doc.get(fieldName);
    SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>", "</font>");   
            Highlighter highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query));   
            highlighter.setTextFragmenter(new SimpleFragmenter(text.length()));      
            if (text != null) {   
                TokenStream tokenStream = analyzer.tokenStream(fieldName,new StringReader(text));   
                String highLightText = highlighter.getBestFragment(tokenStream, text);
                System.out.println("★高亮显示第 "+(i+1) +" 条检索结果如下所示:");
                System.out.println(highLightText);   
            }
   }
   searcher.close();
}


public static void main(String[] args) {    // 测试主函数
   MyHighLighter mhl = new MyHighLighter();
   try {
    mhl.createIndex();
    mhl.search("contents", "因为");
   } catch (CorruptIndexException e) {
    e.printStackTrace();
   } catch (IOException e) {
    e.printStackTrace();
   } catch (ParseException e) {
    e.printStackTrace();
   }
}

}

程序说明:

1、createIndex()方法:使用ThesaurusAnalyzer分析器为指定的文本建立索引。每个Document中都有一个name为contents的Field。在实际应用中,可以再构造一一个name为path的Field,指定检索到的文件的路径(本地路径或者网络上的链接地址)

2、根据已经建好的索引库进行检索。这首先需要解析用户输入的检索关键字,使用QueryParser,必须与后台使用的分析器相同,否则不能保证解析得到的查询(由词条构造)Query检索到合理的结果集。

3、根据解析出来的Query进行检索,检索结果集保存在Hits中。遍历,提取每个满足条件的Document的内容,程序中直接把它的内容当作摘要内容,实现高亮显示。在实际应用中,应该对应着一个提取摘要(或者检索数据库得到检索关键字对应的结果集文件的摘要内容)的过程。有了摘要以后,就可以为摘要内容增加高亮格式。

4、如果提取结果集文件的前N个字符串作为摘要,只需要在 highlighter.setTextFragmenter(new SimpleFragmenter(text.length())); 中设置显示摘要的字数,这里显示全部的文本作为摘要。

运行程序,结果如下所示:

词库尚未被初始化,开始初始化词库.
初始化词库结束。用时:3906毫秒;
共添加195574个词语。
★高亮显示第 1 条检索结果如下所示:
<font color='red'>因为</font>火烧云总是燃烧着消失在太阳冲下地平线的时刻,然后便是宁静的自然的天籁,没有谁会在这样的时光的镜片里伤感自语,<font color='red'>因为</font>灿烂给人以安静的舒适感。
★高亮显示第 2 条检索结果如下所示:
<font color='red'>因为</font>带有以伤痕为代价的美丽风景总是让人不由地惴惴不安,紧接着袭面而来的抑或是病痛抑或是灾难,没有谁会能够安逸着恬然,<font color='red'>因为</font>模糊让人撕心裂肺地想呐喊。
★高亮显示第 3 条检索结果如下所示:
我喜欢上了一个人孤独地行游,在梦与海洋的交接地带炽烈燃烧着。<font color='red'>因为</font>,一条孤独的鱼喜欢上了火焰的颜色,真是荒唐地不合逻辑。

上面的检索结果在HTML网页中,就会高亮显示关键字“因为”,显示为红色。

posted @ 2008-08-06 11:24 轩辕 阅读(197) | 评论 (0)编辑 收藏

Lucene关键字高亮显示

package searchfileexample;

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;
import java.io.IOException;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;


public class MyHighLighterServlet extends HttpServlet {
  private static final String CONTENT_TYPE = "text/html; charset=GB18030";

  private String indexPath = "C:\\index";
  private Analyzer analyzer;
  private IndexSearcher searcher;

  //Initialize global variables
  public void init() throws ServletException {
    analyzer = new StandardAnalyzer();
  }
  public void createIndex() throws IOException {   // 该方法建立索引
       IndexWriter writer = new IndexWriter(indexPath,analyzer,true);
       Document docA = new Document();
       String fileTextA = "因为火烧云总是燃烧着消失在太阳冲下地平线的时刻,然后便是宁静的自然的天籁,没有谁会在这样的时光的镜片里伤感自语,因为灿烂给人以安静的舒适感。";
       Field fieldA = new Field("contents", fileTextA, Field.Store.YES,Field.Index.TOKENIZED);
       docA.add(fieldA);
 
       Document docB = new Document();
       String fileTextB = "因为带有以伤痕为代价的美丽风景总是让人不由地惴惴不安,紧接着袭面而来的抑或是病痛抑或是灾难,没有谁会能够安逸着恬然,因为模糊让人撕心裂肺地想呐喊。";
       Field fieldB = new Field("contents", fileTextB, Field.Store.YES,Field.Index.TOKENIZED);
       docB.add(fieldB);
 
       Document docC = new Document();
       String fileTextC = "我喜欢上了一个人孤独地行游,在梦与海洋的交接地带炽烈燃烧着。"+
       "因为,一条孤独的鱼喜欢上了火焰的颜色,真是荒唐地不合逻辑,原因。";
       Field fieldC = new Field("contents", fileTextC, Field.Store.YES,Field.Index.TOKENIZED);
       docC.add(fieldC);
 
       writer.addDocument(docA);
       writer.addDocument(docB);
       writer.addDocument(docC);
       writer.optimize();
       writer.close();
    }
 
    public void search(String fieldName,String keyword,PrintWriter out) throws CorruptIndexException, IOException, ParseException{   // 检索的方法,并实现高亮显示
       searcher = new IndexSearcher(indexPath);
       QueryParser queryParse = new QueryParser(fieldName, analyzer);     //   构造QueryParser,解析用户输入的检索关键字
       Query query = queryParse.parse(keyword);
       Hits hits = searcher.search(query);
       for(int i=0;i<hits.length();i++){
        Document doc = hits.doc(i);
        String text = doc.get(fieldName);
        SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color='red'>", "</font>");   
                Highlighter highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query));   
                highlighter.setTextFragmenter(new SimpleFragmenter(text.length()));      
                if (text != null) {   
                    TokenStream tokenStream = analyzer.tokenStream(fieldName,new StringReader(text));   
                    String highLightText = highlighter.getBestFragment(tokenStream, text);
                    System.out.println("★高亮显示第 "+(i+1) +" 条检索结果如下所示:");
                    out.println(highLightText);   
                }
       }
       searcher.close();
    }

  //Process the HTTP Get request
  public void service(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
    response.setContentType(CONTENT_TYPE);
    PrintWriter out = response.getWriter();
    out.println("<html>");
    out.println("<head><title>MyHighLighterServlet</title></head>");
    out.println("<body bgcolor=\"#ffffff\">");

  
     try {
      createIndex();
      search("contents", "因为",out);
     } catch (CorruptIndexException e) {
      e.printStackTrace();
     } catch (IOException e) {
      e.printStackTrace();
     } catch (ParseException e) {
      e.printStackTrace();
     }

   
   
   
    out.println("</body></html>");
  }

  //Clean up resources
  public void destroy() {
  }
}

posted @ 2008-08-06 11:22 轩辕 阅读(859) | 评论 (0)编辑 收藏

2008年7月8日

_blank _self的含义

_blank -- 打开一个新窗体
_parent -- 在父窗体中打开
_self -- 在本页打开,此为默认值
_top -- 在上层窗体中打开
_search --同时打开搜索窗口

一个对应的框架页的名称 -- 在对应框架页中打开

posted @ 2008-07-08 10:23 轩辕 阅读(198) | 评论 (0)编辑 收藏

2008年6月5日

prototype.js开发笔记

     摘要: Table of Contents 1. Programming Guide 1.1. Prototype是什么? 1.2. 关联文章 1.3. 通用性方法 1.3.1. 使用 $()方法 1.3.2. 使用$F()方法 1.3.3. 使用$A()方法 1.3.4. 使用$H()方法 1.3.5. 使用$R()方法 1.3.6. 使用Try.these()方...  阅读全文

posted @ 2008-06-05 15:56 轩辕 阅读(163) | 评论 (0)编辑 收藏

2008年3月19日

全文检索第二版,分别对TXT,WORD,EXCEL文件进行了处理

package searchfileexample;

/**
 * 读取Excel文件
 */
import java.io.*;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFDateUtil;
import java.util.Date;
import org.apache.poi.hssf.usermodel.HSSFRow;

public class ExcelReader {
  // 创建文件输入流
  private BufferedReader reader = null;

  // 文件类型
  private String filetype;

  // 文件二进制输入流
  private InputStream is = null;

  // 当前的Sheet
  private int currSheet;

  // 当前位置
  private int currPosition;

  // Sheet数量
  private int numOfSheets;

  // HSSFWorkbook
  HSSFWorkbook workbook = null;
  // 设置Cell之间以空格分割
  private static String EXCEL_LINE_DELIMITER = " ";

  // 设置最大列数
  private static int MAX_EXCEL_COLUMNS = 64;

  public int rows = 0;
  public int getRows() {
    return rows;
  }

  // 构造函数创建一个ExcelReader

  public ExcelReader(String inputfile) throws IOException, Exception {
    // 判断参数是否为空或没有意义
    if (inputfile == null || inputfile.trim().equals("")) {
      throw new IOException("no input file specified");
    }
    // 取得文件名的后缀名赋值给filetype
    this.filetype = inputfile.substring(inputfile.lastIndexOf(".") + 1);
    // 设置开始行为0
    currPosition = 0;
    // 设置当前位置为0
    currSheet = 0;
    // 创建文件输入流
    is = new FileInputStream(inputfile);
    // 判断文件格式
    if (filetype.equalsIgnoreCase("txt")) {
      // 如果是txt则直接创建BufferedReader读取
      reader = new BufferedReader(new InputStreamReader(is));
    }
    else if (filetype.equalsIgnoreCase("xls")) {
      // 如果是Excel文件则创建HSSFWorkbook读取
      workbook = new HSSFWorkbook(is);
      // 设置Sheet数
      numOfSheets = workbook.getNumberOfSheets();
    }
    else {
      throw new Exception("File Type Not Supported");
    }
  }

  // 函数readLine读取文件的一行
  public String readLine() throws IOException {
    // 如果是txt文件则通过reader读取
    if (filetype.equalsIgnoreCase("txt")) {
      String str = reader.readLine();
      // 空行则略去,直接读取下一行
      while (str.trim().equals("")) {
        str = reader.readLine();
      }
      return str;
    }
    // 如果是XLS文件则通过POI提供的API读取文件
    else if (filetype.equalsIgnoreCase("xls")) {
      // 根据currSheet值获得当前的sheet
      HSSFSheet sheet = workbook.getSheetAt(currSheet);
      rows = sheet.getLastRowNum();
      // 判断当前行是否到但前Sheet的结尾
      if (currPosition > sheet.getLastRowNum()) {
        // 当前行位置清零
        currPosition = 0;
        // 判断是否还有Sheet
        while (currSheet != numOfSheets - 1) {
          // 得到下一张Sheet
          sheet = workbook.getSheetAt(currSheet + 1);
          // 当前行数是否已经到达文件末尾
          if (currPosition == sheet.getLastRowNum()) {
            // 当前Sheet指向下一张Sheet
            currSheet++;
            continue;
          }
          else {
            // 获取当前行数
            int row = currPosition;
            currPosition++;
            // 读取当前行数据
            return getLine(sheet, row);
          }
        }
        return null;
      }
      // 获取当前行数
      int row = currPosition;
      currPosition++;
      // 读取当前行数据
      return getLine(sheet, row);
    }
    return null;
  }

  // 函数getLine返回Sheet的一行数据
  private String getLine(HSSFSheet sheet, int row) {
    // 根据行数取得Sheet的一行
    HSSFRow rowline = sheet.getRow(row);
    // 创建字符创缓冲区
    StringBuffer buffer = new StringBuffer();
    // 获取当前行的列数
    int filledColumns = rowline.getLastCellNum();
    HSSFCell cell = null;
    // 循环遍历所有列
    for (int i = 0; i < filledColumns; i++) {
      // 取得当前Cell
      cell = rowline.getCell( (short) i);
      String cellvalue = null;
      if (cell != null) {
        // 判断当前Cell的Type
        switch (cell.getCellType()) {
          // 如果当前Cell的Type为NUMERIC
          case HSSFCell.CELL_TYPE_NUMERIC: {
            // 判断当前的cell是否为Date
            if (HSSFDateUtil.isCellDateFormatted(cell)) {
              // 如果是Date类型则,取得该Cell的Date值
              Date date = cell.getDateCellValue();
              // 把Date转换成本地格式的字符串
              cellvalue = cell.getDateCellValue().toLocaleString();
            }
            // 如果是纯数字
            else {
              // 取得当前Cell的数值
              Integer num = new Integer( (int) cell
                                        .getNumericCellValue());
              cellvalue = String.valueOf(num);
            }
            break;
          }
          // 如果当前Cell的Type为STRIN
          case HSSFCell.CELL_TYPE_STRING:

            // 取得当前的Cell字符串
            cellvalue = cell.getStringCellValue().replaceAll("'", "''");
            break;
            // 默认的Cell值
          default:
            cellvalue = " ";
        }
      }
      else {
        cellvalue = "";
      }
      // 在每个字段之间插入分割符
      buffer.append(cellvalue).append(EXCEL_LINE_DELIMITER);
    }
    // 以字符串返回该行的数据
    return buffer.toString();
  }

  // close函数执行流的关闭操作
  public void close() {
    // 如果is不为空,则关闭InputSteam文件输入流
    if (is != null) {
      try {
        is.close();
      }
      catch (IOException e) {
        is = null;
      }
    }
    // 如果reader不为空则关闭BufferedReader文件输入流
    if (reader != null) {
      try {
        reader.close();
      }
      catch (IOException e) {
        reader = null;
      }
    }
  }

  public static void main(String[] args) {
    try {
      ExcelReader er = new ExcelReader("d:\\xp.xls");
      String line = er.readLine();
      while (line != null) {
        System.out.println(line);
        line = er.readLine();
      }
      er.close();
    }
    catch (Exception e) {
      e.printStackTrace();
    }
  }

}

package searchfileexample;

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;
import java.util.*;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Date;
import org.apache.lucene.demo.FileDocument;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import java.io.FileReader;
import org.apache.lucene.index.*;
import java.text.DateFormat;
import org.apache.poi.hdf.extractor.WordDocument;
import java.io.InputStream;
import java.io.StringWriter;
import java.io.PrintWriter;
import java.io.FileInputStream;
import java.io.*;
import org.textmining.text.extraction.WordExtractor;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;

/**
 * 给某个目录下的所有文件生成索引
 * <p>Title: </p>
 * <p>Description: </p>
 * <p>Copyright: Copyright (c) 2007</p>
 * <p>Company: </p>
 * @author not attributable
 * @version 1.0
 * 根据文件的不同,可以把索引文件创建到不同的文件夹下去,这样可以分类保存索引信息。
 */

public class IndexFilesServlet
    extends HttpServlet {
  static final File INDEX_DIR = new File("index");

  //Initialize global variables
  public void init() throws ServletException {
  }

  //Process the HTTP Get request
  public void service(HttpServletRequest request, HttpServletResponse response) throws
      ServletException, IOException {
    final File docDir = new File("a"); //需要生成索引的文件的文件夹
    if (!docDir.exists() || !docDir.canRead()) {
      System.out.println("Document directory '" + docDir.getAbsolutePath() +
                         "' does not exist or is not readable, please check the path");
      System.exit(1);
    }

    Date start = new Date();
    try {
      IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true); //true-覆盖原有的索引 false-不覆盖原有的索引
      System.out.println("Indexing to directory '" + INDEX_DIR + "'...");
      indexDocs(writer, docDir);
      System.out.println("Optimizing...");
      writer.optimize();
      writer.close();

      Date end = new Date();
      System.out.println(end.getTime() - start.getTime() +
                         " total milliseconds");

    }
    catch (IOException e) {
      System.out.println(" caught a " + e.getClass() +
                         "\n with message: " + e.getMessage());
    }

  }

  //Clean up resources
  public void destroy() {
  }

  public void indexDocs(IndexWriter writer, File file) throws IOException {
    // do not try to index files that cannot be read
    int index = 0;
    String filehouzui = "";
    index = file.getName().indexOf(".");
    //strFileName = strFileName.substring(0, index) +DateUtil.getCurrDateTime() + "." + strFileName.substring(index + 1);
    filehouzui = file.getName().substring(index + 1);

    if (file.canRead()) {
      if (file.isDirectory()) {
        String[] files = file.list();
        // an IO error could occur
        if (files != null) {
          for (int i = 0; i < files.length; i++) {
            indexDocs(writer, new File(file, files[i]));
          }
        }
      }
      else {
        System.out.println("adding " + file);
        try {
          if (filehouzui.equals("doc")) {
            writer.addDocument(getWordDocument(file, new FileInputStream(file)));
          }
          else if (filehouzui.equals("txt")) {
            writer.addDocument(getTxtDocument(file, new FileInputStream(file)));
          }
          else if (filehouzui.equals("xls")) {
            writer.addDocument(getExcelDocument(file, new FileInputStream(file)));
          }
          //writer.addDocument(parseFile(file));

          //writer.addDocument(FileDocument.Document(file));//path 存放文件的相对路径
        }
        // at least on windows, some temporary files raise this exception with an "access denied" message
        // checking if the file can be read doesn't help
        catch (Exception fnfe) {
          ;
        }
      }
    }
  }

  /**
   *@paramfile
   *
   *把File变成Document
   */
  public Document parseFile(File file) throws Exception {
    Document doc = new Document();
    doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                      Field.Index.UN_TOKENIZED)); //取文件的绝对路径
    try {
      doc.add(new Field("contents", new FileReader(file))); //索引文件内容
      doc.add(new Field("title", file.getName(), Field.Store.YES,
                        Field.Index.UN_TOKENIZED));
      //索引最后修改时间
      doc.add(new Field("modified",
                        String.valueOf(DateFormat.
                                       getDateTimeInstance().format(new
          Date(file.lastModified()))), Field.Store.YES,
                        Field.Index.UN_TOKENIZED));
      //doc.removeField("title");
    }
    catch (Exception e) {
      e.printStackTrace();
    }
    return doc;
  }

 
  /**
   *@paramfile
   *
   *使用POI读取word文档
   * 不太好用,读取word文档不全
   */
  public Document getDocument(File file, FileInputStream is) throws Exception {
    String bodyText = null;
    try {
      WordDocument wd = new WordDocument(is);
      StringWriter docTextWriter = new StringWriter();
      wd.writeAllText(new PrintWriter(docTextWriter));
      bodyText = docTextWriter.toString();
      docTextWriter.close();
      //   bodyText   =   new   WordExtractor().extractText(is);
      System.out.println("word content====" + bodyText);
    }
    catch (Exception e) {
      ;
    }
    if ( (bodyText != null)) {
      Document doc = new Document();
      doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                        Field.Index.UN_TOKENIZED)); //取文件的绝对路径
      doc.add(new Field("contents", bodyText, Field.Store.YES,
                        Field.Index.TOKENIZED));
      return doc;
    }
    return null;
  }

  //Document   doc   =   getDocument(new   FileInputStream(new   File(file)));
  /**
   *@paramfile
   *
   *使用tm-extractors-0.4.jar读取word文档
   * 好用
   */
  public Document getWordDocument(File file, FileInputStream is) throws
      Exception {
    String bodyText = null;
    try {
      WordExtractor extractor = new WordExtractor();
      System.out.println("word文档");
      bodyText = extractor.extractText(is);
      if ( (bodyText != null)) {
        Document doc = new Document();
        doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                          Field.Index.UN_TOKENIZED)); //取文件的绝对路径
        doc.add(new Field("contents", bodyText, Field.Store.YES,
                          Field.Index.TOKENIZED));
        System.out.println("word content====" + bodyText);
        return doc;
      }
    }
    catch (Exception e) {
      ;
    }
    return null;
  }

  /**
   *@paramfile
   *
   *读取TXT文档
   */
  public Document getTxtDocument(File file, FileInputStream is) throws
      Exception {
    try {
      Reader textReader = new FileReader(file);
      Document doc = new Document();
      doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                        Field.Index.UN_TOKENIZED)); //取文件的绝对路径
      doc.add(new Field("contents", textReader));
      return doc;
    }
    catch (Exception e) {
      ;
    }
    return null;
  }

  /**
   * 使用POI读取Excel文件
   * @param file File
   * @param is FileInputStream
   * @throws Exception
   * @return Document
   */
  public Document getExcelDocument(File file, FileInputStream is) throws
      Exception {
    String bodyText = "";
    try {
      System.out.println("读取excel文件");
      ExcelReader er = new ExcelReader(file.getAbsolutePath());
      bodyText = er.readLine();
      int rows = 0;
      rows = er.getRows();
      for (int i = 0; i < rows; i++) {
        bodyText = bodyText + er.readLine();
        System.out.println("bodyText===" + bodyText);
      }
      Document doc = new Document();
      doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                        Field.Index.UN_TOKENIZED)); //取文件的绝对路径
      doc.add(new Field("contents", bodyText, Field.Store.YES,
                        Field.Index.TOKENIZED));
      System.out.println("word content====" + bodyText);
      return doc;
    }
    catch (Exception e) {
      System.out.println(e);
    }
    return null;
  }
}


 

package searchfileexample;

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;
import java.util.*;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.FilterIndexReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;
import org.apache.lucene.queryParser.*;

public class SearchFileServlet
    extends HttpServlet {
  private static final String CONTENT_TYPE = "text/html; charset=GBK";

  //Initialize global variables
  public void init() throws ServletException {
  }

  /** Use the norms from one field for all fields.  Norms are read into memory,
   * using a byte of memory per document per searched field.  This can cause
   * search of large collections with a large number of fields to run out of
   * memory.  If all of the fields contain only a single token, then the norms
   * are all identical, then single norm vector may be shared. */
  private static class OneNormsReader
      extends FilterIndexReader {
    private String field;

    public OneNormsReader(IndexReader in, String field) {
      super(in);
      this.field = field;
    }

    public byte[] norms(String field) throws IOException {
      return in.norms(this.field);
    }
  }

  //Process the HTTP Get request
  public void service(HttpServletRequest request, HttpServletResponse response) throws
      ServletException, IOException {
    response.setContentType(CONTENT_TYPE);
    PrintWriter out = response.getWriter();

    String[] args = {
        "a", "b"};
    String usage =
        "Usage: java org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-raw] [-norms field]";
    if (args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) {
      System.out.println(usage);
      System.exit(0);
    }

    String index = "index"; //该值是用来存放生成的索引文件的文件夹的名称,不能改动
    String field = "contents"; //不能修改  field  的值
    String queries = null; //是用来存放需要检索的关键字的一个文件。
    queries = "D:/lfy_programe/全文检索/SearchFileExample/aa.txt";
    System.out.println("-----------------------" + request.getContextPath());
    int repeat = 1;
    boolean raw = false;
    String normsField = null;

    for (int i = 0; i < args.length; i++) {
      if ("-index".equals(args[i])) {
        index = args[i + 1];
        i++;
      }
      else if ("-field".equals(args[i])) {
        field = args[i + 1];
        i++;
      }
      else if ("-queries".equals(args[i])) {
        queries = args[i + 1];
        i++;
      }
      else if ("-repeat".equals(args[i])) {
        repeat = Integer.parseInt(args[i + 1]);
        i++;
      }
      else if ("-raw".equals(args[i])) {
        raw = true;
      }
      else if ("-norms".equals(args[i])) {
        normsField = args[i + 1];
        i++;
      }
    }

    IndexReader reader = IndexReader.open(index);

    if (normsField != null) {
      reader = new OneNormsReader(reader, normsField);

    }
    Searcher searcher = new IndexSearcher(reader); //用来打开索引文件
    Analyzer analyzer = new StandardAnalyzer(); //分析器
    //Analyzer analyzer = new StandardAnalyzer();

    BufferedReader in = null;
    if (queries != null) {
      in = new BufferedReader(new FileReader(queries));
    }
    else {
      in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
    }
    QueryParser parser = new QueryParser(field, analyzer);

    out.println("<html>");
    out.println("<head><title>SearchFileServlet</title></head>");
    out.println("<body bgcolor=\"#ffffff\">");

    while (true) {
      if (queries == null) { // prompt the user
        System.out.println("Enter query: ");

      }
      String line = in.readLine(); //组成查询关键字字符串
      System.out.println("查询字符串===" + line);

      if (line == null || line.length() == -1) {
        break;
      }

      line = line.trim();
      if (line.length() == 0) {
        break;
      }

      Query query = null;
      try {
        query = parser.parse(line);
      }
      catch (ParseException ex) {
      }
      System.out.println("Searching for: " + query.toString(field)); //每个关键字

      Hits hits = searcher.search(query);

      if (repeat > 0) { // repeat & time as benchmark
        Date start = new Date();
        for (int i = 0; i < repeat; i++) {
          hits = searcher.search(query);
        }
        Date end = new Date();
        System.out.println("Time: " + (end.getTime() - start.getTime()) + "ms");
      }
      out.println("<p>查询到:" + hits.length() + "个含有[" +
                  query.toString(field) + "]的文档</p>");

      System.out.println("查询到:" + hits.length() + " 个含有 [" +
                         query.toString(field) + "]的文档");

      final int HITS_PER_PAGE = 10; //查询返回的最大记录数
      int currentNum = 5; //当前记录数

      for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) {
        //start = start + currentNum;
        int end = Math.min(hits.length(), start + HITS_PER_PAGE);

        for (int i = start; i < end; i++) {

          //if (raw) {                              // output raw format
          System.out.println("doc=" + hits.id(i) + " score=" + hits.score(i)); //score是接近度的意思
          //continue;
          //}

          Document doc = hits.doc(i);
          String path = doc.get("path");

          if (path != null) {
            System.out.println( (i + 1) + ". " + path);
            out.println("<p>" + (i + 1) + ". " + path + "</p>");
            String title = doc.get("title");
            System.out.println("   modified: " + doc.get("modified"));
            if (title != null) {
              System.out.println("   Title: " + doc.get("title"));
            }
          }
          else {
            System.out.println( (i + 1) + ". " + "No path for this document");
          }
        }

        if (queries != null) { // non-interactive
          break;
        }

        if (hits.length() > end) {
          System.out.println("more (y/n) ? ");
          line = in.readLine();
          if (line.length() == 0 || line.charAt(0) == 'n') {
            break;
          }
        }
      }
    }
    reader.close();

    out.println("</body></html>");
  }

//Clean up resources
  public void destroy() {
  }
}


 

 

posted @ 2008-03-19 16:52 轩辕 阅读(914) | 评论 (0)编辑 收藏

2008年3月18日

全文检索

package searchfileexample;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Date;
import org.apache.lucene.demo.FileDocument;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import java.io.FileReader;
import org.apache.lucene.index.*;
import java.text.DateFormat;
import org.apache.poi.hdf.extractor.WordDocument;
import java.io.InputStream;
import java.io.StringWriter;
import java.io.PrintWriter;
import java.io.FileInputStream;
import java.io.*;
import org.textmining.text.extraction.WordExtractor;

/**
 * 给某个目录下的所有文件生成索引
 * <p>Title: </p>
 * <p>Description: </p>
 * <p>Copyright: Copyright (c) 2007</p>
 * <p>Company: </p>
 * @author not attributable
 * @version 1.0
 * 根据文件的不同,可以把索引文件创建到不同的文件夹下去,这样可以分类保存索引信息。
 */

/** Index all text files under a directory. */
public class IndexFiles {

  private IndexFiles() {}

  static final File INDEX_DIR = new File("index");

  /** Index all text files under a directory. */
  public static void main(String[] args) {
    String usage = "java org.apache.lucene.demo.IndexFiles <root_directory>";
    //String[] arg = {"a","b"};
    //System.out.println(arg[0]);
    /*
         if (args.length == 0) {
      System.err.println("Usage: " + usage);
      System.exit(1);
         }*/
    /*
        if (INDEX_DIR.exists()) {
          System.out.println("Cannot save index to '" +INDEX_DIR+ "' directory, please delete it first");
          System.exit(1);
        }*/

    final File docDir = new File("a"); //需要生成索引的文件的文件夹
    if (!docDir.exists() || !docDir.canRead()) {
      System.out.println("Document directory '" + docDir.getAbsolutePath() +
                         "' does not exist or is not readable, please check the path");
      System.exit(1);
    }

    Date start = new Date();
    try {
      IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true); //true-覆盖原有的索引 false-不覆盖原有的索引
      System.out.println("Indexing to directory '" + INDEX_DIR + "'...");
      indexDocs(writer, docDir);
      System.out.println("Optimizing...");
      writer.optimize();
      writer.close();

      Date end = new Date();
      System.out.println(end.getTime() - start.getTime() +
                         " total milliseconds");

    }
    catch (IOException e) {
      System.out.println(" caught a " + e.getClass() +
                         "\n with message: " + e.getMessage());
    }
  }

  static void indexDocs(IndexWriter writer, File file) throws IOException {
    // do not try to index files that cannot be read
    if (file.canRead()) {
      if (file.isDirectory()) {
        String[] files = file.list();
        // an IO error could occur
        if (files != null) {
          for (int i = 0; i < files.length; i++) {
            indexDocs(writer, new File(file, files[i]));
          }
        }
      }
      else {
        System.out.println("adding " + file);
        try {

          writer.addDocument(getDocument2(file, new FileInputStream(file)));
          //writer.addDocument(parseFile(file));

          //writer.addDocument(FileDocument.Document(file));//path 存放文件的相对路径
        }
        // at least on windows, some temporary files raise this exception with an "access denied" message
        // checking if the file can be read doesn't help
        catch (Exception fnfe) {
          ;
        }
      }
    }
  }

  /**
   *@paramfile
   *
   *把File变成Document
   */
  static Document parseFile(File file) throws Exception {
    Document doc = new Document();
    doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                      Field.Index.UN_TOKENIZED)); //取文件的绝对路径
    try {
      doc.add(new Field("contents", new FileReader(file))); //索引文件内容
      doc.add(new Field("title", file.getName(), Field.Store.YES,
                        Field.Index.UN_TOKENIZED));
      //索引最后修改时间
      doc.add(new Field("modified",
                        String.valueOf(DateFormat.
                                       getDateTimeInstance().format(new
          Date(file.lastModified()))), Field.Store.YES,
                        Field.Index.UN_TOKENIZED));
      //doc.removeField("title");
    }
    catch (Exception e) {
      e.printStackTrace();
    }
    return doc;
  }

  /**
   *@paramfile
   *
   *转换word文档

         static String changeWord(File file) throws Exception {
    String re = "";
    try {
      WordDocument wd = new WordDocument(is);
        StringWriter docTextWriter = new StringWriter();
        wd.writeAllText(new PrintWriter(docTextWriter));
        docTextWriter.close();
        bodyText = docTextWriter.toString();

    } catch (Exception e) {
        e.printStackTrace();
    }
    return re;
         }*/
  /**
   *@paramfile
   *
   *使用POI读取word文档
   */
  static Document getDocument(File file, FileInputStream is) throws Exception {

    String bodyText = null;

    try {

      //BufferedReader wt = new BufferedReader(new InputStreamReader(is));
      //bodyText = wt.readLine();
      //System.out.println("word ===="+bodyText);

      WordDocument wd = new WordDocument(is);
      StringWriter docTextWriter = new StringWriter();
      wd.writeAllText(new PrintWriter(docTextWriter));
      bodyText = docTextWriter.toString();
      docTextWriter.close();
      //   bodyText   =   new   WordExtractor().extractText(is);
      System.out.println("word content====" + bodyText);
    }
    catch (Exception e) {
      ;

    }

    if ( (bodyText != null)) {
      Document doc = new Document();
      doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                        Field.Index.UN_TOKENIZED)); //取文件的绝对路径
      doc.add(new Field("contents", bodyText, Field.Store.YES,
                        Field.Index.TOKENIZED));

      return doc;
    }
    return null;
  }

  //Document   doc   =   getDocument(new   FileInputStream(new   File(file)));
  /**
   *@paramfile
   *
   *使用tm-extractors-0.4.jar读取word文档
   */
  static Document getDocument2(File file, FileInputStream is) throws Exception {

    String bodyText = null;

    try {

      //FileInputStream in = new FileInputStream("D:/lfy_programe/全文检索/SearchFileExample/a/aa.doc");
      //  FileInputStream in = new FileInputStream ("D:/szqxjzhbase/技术测试/新建 Microsoft Word 文档.doc");
      WordExtractor extractor = new WordExtractor();
      System.out.println(is.available());

      bodyText = extractor.extractText(is);

//    System.out.println("the result length is"+str.length());
      System.out.println("word content===="+bodyText);

    }
    catch (Exception e) {
      ;

    }

    if ( (bodyText != null)) {
      Document doc = new Document();
      doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
                        Field.Index.UN_TOKENIZED)); //取文件的绝对路径
      doc.add(new Field("contents", bodyText, Field.Store.YES,
                        Field.Index.TOKENIZED));

      return doc;
    }
    return null;
  }

}


 

package searchfileexample;


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.FilterIndexReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;


import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.KeywordAnalyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Fieldable;

/** Simple command-line based search demo. */
public class SearchFiles {

  /** Use the norms from one field for all fields.  Norms are read into memory,
   * using a byte of memory per document per searched field.  This can cause
   * search of large collections with a large number of fields to run out of
   * memory.  If all of the fields contain only a single token, then the norms
   * are all identical, then single norm vector may be shared. */
  private static class OneNormsReader extends FilterIndexReader {
    private String field;

    public OneNormsReader(IndexReader in, String field) {
      super(in);
      this.field = field;
    }

    public byte[] norms(String field) throws IOException {
      return in.norms(this.field);
    }
  }

  private SearchFiles() {}

  /** Simple command-line based search demo. */
  public static void main(String[] arg) throws Exception {
    String[] args = {"a","b"};
    String usage =
      "Usage: java org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-raw] [-norms field]";
    if (args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) {
      System.out.println(usage);
      System.exit(0);
    }

    String index = "index";//该值是用来存放生成的索引文件的文件夹的名称,不能改动
    String field = "contents";//不能修改  field  的值
    String queries = null;//是用来存放需要检索的关键字的一个文件。
    queries = "D:/lfy_programe/全文检索/SearchFileExample/aa.txt";

    int repeat = 1;
    boolean raw = false;
    String normsField = null;

    for (int i = 0; i < args.length; i++) {
      if ("-index".equals(args[i])) {
        index = args[i+1];
        i++;
      } else if ("-field".equals(args[i])) {
        field = args[i+1];
        i++;
      } else if ("-queries".equals(args[i])) {
        queries = args[i+1];
        i++;
      } else if ("-repeat".equals(args[i])) {
        repeat = Integer.parseInt(args[i+1]);
        i++;
      } else if ("-raw".equals(args[i])) {
        raw = true;
      } else if ("-norms".equals(args[i])) {
        normsField = args[i+1];
        i++;
      }
    }

    IndexReader reader = IndexReader.open(index);

    if (normsField != null)
      reader = new OneNormsReader(reader, normsField);

    Searcher searcher = new IndexSearcher(reader);//用来打开索引文件
    Analyzer analyzer = new StandardAnalyzer();//分析器
    //Analyzer analyzer = new StandardAnalyzer();

    BufferedReader in = null;
    if (queries != null) {
      in = new BufferedReader(new FileReader(queries));
    } else {
      in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
    }
      QueryParser parser = new QueryParser(field, analyzer);
    while (true) {
      if (queries == null)                        // prompt the user
        System.out.println("Enter query: ");

      String line = in.readLine();//组成查询关键字字符串
      System.out.println("查询字符串==="+line);

      if (line == null || line.length() == -1)
        break;

      line = line.trim();
      if (line.length() == 0)
        break;

      Query query = parser.parse(line);
      System.out.println("Searching for: " + query.toString(field));//每个关键字

      Hits hits = searcher.search(query);

      if (repeat > 0) {                           // repeat & time as benchmark
        Date start = new Date();
        for (int i = 0; i < repeat; i++) {
          hits = searcher.search(query);
        }
        Date end = new Date();
        System.out.println("Time: "+(end.getTime()-start.getTime())+"ms");
      }

      System.out.println("查询到:" + hits.length() + " 个含有 ["+query.toString(field)+"]的文档");

      final int HITS_PER_PAGE = 10;//查询返回的最大记录数
      int currentNum = 2;//当前记录数
      for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) {
        //start = start + currentNum;
        int end = Math.min(hits.length(), start + HITS_PER_PAGE);
        for (int i = start; i < end; i++) {

          //if (raw) {                              // output raw format
            System.out.println("doc="+hits.id(i)+" score="+hits.score(i));//score是接近度的意思
            //continue;
          //}

          Document doc = hits.doc(i);
          String path = doc.get("path");


          if (path != null) {
            System.out.println((i+1) + ". " + path);
            String title = doc.get("title");
            System.out.println("   modified: " + doc.get("modified"));
            if (title != null) {
              System.out.println("   Title: " + doc.get("title"));
            }
          } else {
            System.out.println((i+1) + ". " + "No path for this document");
          }
        }

        if (queries != null)                      // non-interactive
          break;

        if (hits.length() > end) {
          System.out.println("more (y/n) ? ");
          line = in.readLine();
          if (line.length() == 0 || line.charAt(0) == 'n')
            break;
        }
      }
    }
    reader.close();
  }
}


 

package searchfileexample;

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;
import java.util.*;
import org.textmining.text.extraction.WordExtractor;

public class ReadWord extends HttpServlet {
  private static final String CONTENT_TYPE = "text/html; charset=GBK";

  //Initialize global variables
  public void init() throws ServletException {
  }

  //Process the HTTP Get request
  public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
    response.setContentType(CONTENT_TYPE);
    FileInputStream in = new FileInputStream ("D:/lfy_programe/全文检索/SearchFileExample/a/aa.doc");
       //  FileInputStream in = new FileInputStream ("D:/szqxjzhbase/技术测试/新建 Microsoft Word 文档.doc");
   WordExtractor extractor = new WordExtractor();
   System.out.println(in.available());
  String str = null;
  try {
    str = extractor.extractText(in);
  }
  catch (Exception ex) {
  }
//    System.out.println("the result length is"+str.length());
   System.out.println(str);

  }

  //Clean up resources
  public void destroy() {
  }
}

1.英文的模糊查询问题
查询时的关键字的后边加上通配符  " * " 就可以了。

2.IndexFiles.java
用来索引文件的java类

3.SearchFiles.java
用来搜索的java类

4.ReadWord.java
使用tm-extractors-0.4.jar来读取word文件


 

 

posted @ 2008-03-18 10:35 轩辕 阅读(253) | 评论 (0)编辑 收藏

使用tm-extractors-0.4.jar来读取word文件

package searchfileexample;

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;
import java.util.*;
import org.textmining.text.extraction.WordExtractor;

public class ReadWord extends HttpServlet {
  private static final String CONTENT_TYPE = "text/html; charset=GBK";

  //Initialize global variables
  public void init() throws ServletException {
  }

  //Process the HTTP Get request
  public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
    response.setContentType(CONTENT_TYPE);
    FileInputStream in = new FileInputStream ("D:/lfy_programe/全文检索/SearchFileExample/a/aa.doc");
       //  FileInputStream in = new FileInputStream ("D:/szqxjzhbase/技术测试/新建 Microsoft Word 文档.doc");
   WordExtractor extractor = new WordExtractor();
   System.out.println(in.available());
  String str = null;
  try {
    str = extractor.extractText(in);
  }
  catch (Exception ex) {
  }
//    System.out.println("the result length is"+str.length());
   System.out.println(str);

  }

  //Clean up resources
  public void destroy() {
  }
}

posted @ 2008-03-18 10:33 轩辕 阅读(5490) | 评论 (5)编辑 收藏

2007年8月7日

掌控上传进度的AJAX Upload

     摘要: 掌控上传进度的AJAX Upload cleverpig 发表于 2007-01-08 11:12:14作者:cleverpig     来源:Matrix评论数:83 点击数:5,066     投票总得分:12 投票总人次:4关键字:AJAX,upload,monitor ...  阅读全文

posted @ 2007-08-07 17:02 轩辕 阅读(2482) | 评论 (1)编辑 收藏

ajax 上传文件

http://www.matrix.org.cn/resource/article/2007-01-08/09db6d69-9ec6-11db-ab77-2bbe780ebfbf.html

posted @ 2007-08-07 16:54 轩辕 阅读(217) | 评论 (0)编辑 收藏

2007年8月1日

程序下载java程序

  /*
 * 创建日期 2006-1-11
 *
 * 更改所生成文件模板为
 * 窗口 > 首选项 > Java > 代码生成 > 代码和注释
 
*/

package com.abc.cc.util.file ;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import javax.servlet.ServletOutputStream;
import java.io.FileInputStream;

import com.abc.callcenter.DataStatistic.Export.CreatUDStatisticExport;
import com.abc.callcenter.uds.unitedealwith.UniteUtil;

/**
 * 
 * 创建日期:2006-2-9
 * 功  能:工作台 > 文档管理 > 文件下载
 * 
@author asx
 *
 
*/

public class Down extends HttpServlet {
    
public void doGet(HttpServletRequest request , HttpServletResponse response) {
        System.out.println(
"logining Down");
        response.setContentType(
"text/html; charset=GBK");
        String downfile 
= request.getRealPath("/"+ "/exportfile/" + TimeTool.getCurrentDateForEight() + "_" + StringTool.getExportFileName(Integer.parseInt(request.getParameter("fileName"))) ;
        
try {downfile = new String(downfile.getBytes("GBK")) ;}catch(Exception e){}
        System.out.println(
"downfile = "+downfile);
        String fileName 
= buildFilename(downfile) ;
        System.out.println(
"fileName = "+fileName);
        
        String strBeginDate 
= request.getParameter("excel_begindate"); //起始日期
        String strEndDate = request.getParameter("excel_enddate"); //结束日期
        String strUnite_dept = request.getParameter("excel_department_name");//部门
        try{
            strUnite_dept 
= UniteUtil.Query_NameDepartment(""+strUnite_dept);;
        }
catch(Exception e){
            e.printStackTrace();
        }
    
        CreatUDStatisticExport cue 
= new CreatUDStatisticExport();
        cue.queryPrintInfo(strBeginDate,strEndDate,strUnite_dept,request);
        
        System.out.println(
"logining Down1");
        
try 
        

            fileName
=response.encodeURL(new String(fileName.getBytes(),"iso-8859-1"));
            response.reset(); 
            response.setContentType(
"APPLICATION/OCTET-STREAM"); 
            response.setHeader(
"Content-Disposition""attachment; filename=\"" + fileName + "\""); 
            ServletOutputStream out 
= response.getOutputStream(); 
            FileInputStream inStream 
= new FileInputStream(downfile); 
            
            
//循环取出流中的数据 
            byte[] b = new byte[1024]; 
            
int len; 
            
while((len=inStream.read(b , 0 , b.length)) >0{
                out.write(b,
0,len);                 
            }

            out.close(); 
            inStream.close(); 
        }
 catch(Exception e) {}
    }

    
public void doPost(HttpServletRequest request , HttpServletResponse response) {
        doGet(request , response) ;
    }

    
    
/**
     * 转换上传文件的文件名
     * 
@param sou
     * 
@param ts
     * 
@return String
     
*/

    
private static String buildFilename(String sou) {
        
while(sou.indexOf("/"> -1{
            sou 
= sou.substring(sou.indexOf("/"+ 1) ;
        }

        
return sou;
    }

}

posted @ 2007-08-01 15:23 轩辕 阅读(234) | 评论 (0)编辑 收藏