无为

无为则可为，无为则至深！

语源科技BlogJava :: 首页 :: 联系 :: 聚合

:: 管理

190 Posts :: 291 Stories :: 258 Comments :: 0 Trackbacks

Web信息抽取技术纵览二

第六章 总结和讨论

第 6.1. 节总结 ...

第 6.2. 节讨论 ...

第6.1.节总结

信息抽取是近十年来新发展起来的领域。 MUC 等国际研讨会给予高度关注，并提出了评价这类系统的方法，定义了评价指标体系。

信息抽取技术的研究对象包括结构化、半结构化和自由式文档。对于自由式文档，多数采用了自然语言处理的方法，而其他两类文档的处理则多数是基于分隔符的。

网页是信息抽取技术研究的重点之一。通常用分装器从一特定网站上抽取信息。用一系列能处理不同网站的分装器就能将数据统一表示，并获得它们之间的关系。

分装器的建造通常是费事费力的，而且需要专门知识。加上网页动态变化，维护分装器的成本将很高。因此，如何自动构建分装器便成为主要的问题。通常采用的方法包括基于归纳学习的机器学习方法。

有若干研究系统被开发出来。这些系统使用机器学习算法针对网上信息源生成抽取规则。 ShopBot ， WIEN ， SoftMealy 和 STALKER 生成的分装器以分隔符为基础，能处理结构化程度高的网站。 RAPIER ， WHISK 和 SRV 能处理结构化程度稍差的信息源。所采用的抽取方法与传统的 IE 方法一脉相承，而学习算法多用关系学习法。

网站信息抽取和分装器生成技术可在一系列的应用领域内发挥作用。目前只有比价购物方面的商业应用比较成功，而最出色的系统包括 Jango ， Junglee 和 MySimon 。

第6.2.节讨论

目前的搜索引擎并不能收集到网上数据库内的信息。根据用户的查询请求，搜索引擎能找到相关的网页，但不能把上面的信息抽取出来。“暗藏网”不断增加，因此有必要开发一些工具把相关信息从网页上抽取并收集起来。

由于网上信息整合越来越重要，虽然网站信息抽取的研究比较新，但将不断发展。机器学习方法的使用仍将成为主流方法，因为处理动态的海量信息需要自动化程度高的技术。在文献 [52] 中提出，结合不同类型的方法，以开发出适应性强的系统，这应是一个有前途的方向。在文献 [36] 中，一种混合语言知识和句法特征的方法也被提出来。

本文介绍的系统多数是针对 HTML 文档的。以后几年 XML 的使用将被普及。 HTML 描述的是文档的表现方式，是文档的格式语言。 XML 则可以告诉你文档的意义，即定义内容而不只是形式。这虽然使分装器的生成工作变得简单，但不能排除其存在的必要性。

将来的挑战是建造灵活和可升级的分装器自动归纳系统，以适应不断增长的动态网络的需要。

参考文献

[1] S. Abiteboul.

Querying Semistructured Data.

Proceedings of the International Conference on Database Theory (ICDT), Greece,

January 1997.

[2] B. Adelberg.

NoDoSE - A tool for Semi-Automatically Extracting Semistructured Data from Text

Documents.

Proceedings ACM SIGMOD International Conference on Management of Data, Seat-

tle, June 1998.

[3] D. E. Appelt, D. J. Israel.

Introduction to Information Extraction Technology.

Tutorial for IJCAI-99, Stockholm, August 1999.

[4] N. Ashish, C. A. Knoblock.

Semi-automatic Wrapper Generation for Internet Information Sources.

Second IFCIS Conference on Cooperative Information Systems (CoopIS), South Car-

olina, June 1997.

[5] N. Ashish, C. A. Knoblock.

Wrapper Generation for semistructured Internet Sources.

SIGMOD Record, Vol. 26, No. 4, pp. 8--15, December 1997.

[6] P. Atzeni, G. Mecca.

Cut & Paste.

Proceedings of the 16'th ACM SIGACT-SIGMOD-SIGART Symposium on Principles

of Database Systems (PODS'97), Tucson, Arizona, May 1997.

[7] M. Bauer, D. Dengler.

TrIAs - An Architecture for Trainable Information Assistants.

Workshop on AI and Information Integration, in conjunction with the 15'th National

Conference on Artificial Intelligence (AAAI-98), Madison, Wisconsin, July 1998.

[8] P. Berka.

Intelligent Systems on the Internet.

http://lisp.vse.cz/ berka/ai-inet.htm, Laboratory of Intelligent Systems, University

of Economics, Prague.

[9] L. Bright, J. R. Gruser, L. Raschid, M. E. Vidal.

A Wrapper Generation Toolkit to Specify and Construct Wrappers for Web Accessible

Data Sources (WebSources).

Computer Systems Special Issue on Semantics on the WWW, Vol. 14 No. 2, March

1999.

[10] S. Brin.

Extracting Patterns and Relations from the World Wide Web.

International Workshop on the Web and Databases (WebDB'98), Spain, March 1998.

[11] M. E. Califf, R. J. Mooney.

Relational Learning of Pattern-Match Rules for Information Extraction.

Proceedings of the ACL Workshop on Natural Language Learning, Spain, July 1997.

[12] M. E. Califf.

Relational Learning Techniques for Natural Language Information Extraction.

Ph.D. thesis, Department of Computer Sciences, University of Texas, Austin, August

1998. Technical Report AI98-276.

[13] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J.

Ullman, J. Widom.

The TSIMMIS Project: Integration of Heterogeneous Information Sources.

In Proceedings of IPSJ Conference, pp. 7--18, Tokyo, Japan, October 1994.

[14] B. Chidlovskii, U. M. Borghoff, P-Y. Chevalier.

Towards Sophisticated Wrapping of Web-based Information Repositories.

Proceedings of the 5'th International RIAO Conference, Montreal, Quebec, June 1997.

[15] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery.

Learning to Extract Symbolic Knowledge from the World Wide Web.

Proceedings of the 15'th National Conference on Artificial Intelligence (AAAI-98),

Madison , Wisconsin , July 1998.

[16] M. Craven, S. Slattery, K. Nigam.

First-Order Learning for Web Mining.

Proceedings of the 10'th European Conference on Machine Learning, Germany, April

1998.

[17] R. B. Doorenbos, O. Etzioni, D. S. Weld.

A Scalable Comparison-Shopping Agent for the World Wide Web.

Technical report UW-CSE-96-01-03, University of Washington, 1996.

[18] R. B. Doorenbos, O. Etzioni, D. S. Weld.

A Scalable Comparison-Shopping Agent for the World-Wide-Web.

Proceedings of the first International Conference on Autonomous Agents, California,

February 1997.

[19] O. Etzioni

Moving up the Information Food Chain: Deploying Softbots on the World Wide Web.

AI Magazine, 18(2):11-18, 1997.

[20] D. Florescu, A. Levy, A. Mendelzon.

Database Techniques for the World Wide Web: A Survey.

ACM SIGMOD Record, Vol. 27, No. 3, September 1998.

[21] D. Freitag.

Information Extraction from HTML: Application of a General Machine Learning Ap-

proach.

Proceedings of the 15'th National Conference on Artificial Intelligence (AAAI-98),

Madison , Wisconsin , July 1998.

[22] D. Freitag.

Machine Learning for Information Extraction in Informal Domains.

Ph.D. dissertation, CarnegieMellonUniversity, November 1998.

[23] D. Freitag.

Multistrategy Learning for Information Extraction.

Proceedings of the 15'th International Conference on Machine Learning (ICML-98),

Madison , Wisconsin , July 1998.

[24] R. Gaizauskas, Y. Wilks.

Information Extraction: Beyond Document Retrieval.

Computational Linguistics and Chinese Language Processing, vol. 3, no. 2, pp. 17--60,

August 1998,

[25] H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, J.

Widom.

Integrating and Accessing Heterogeneous Information Sources in TSIMMIS.

In Proceedings of the AAAI Symposium on Information Gathering, pp. 61--64, Stan-

ford, California, March 1995.

[26] S. Grumbach and G. Mecca.

In Search of the Lost Schema.

Proceedings of the International Conference on Database Theory (ICDT'99),

Jerusalem , January 1999.

[27] J-R. Gruser, L. Raschid, M. E. Vidal, L. Bright.

Wrapper Generation for Web Accessible Data Source.

Proceedings of the 3'rd IFCIS International Conference on Cooperative Information

Systems (CoopIS-98), New York, August 1998.

[28] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo.

Extracting Semistructured Information from Web.

Proceedings of the Workshop on Management of Semistructured Data, Tucson, Ari-

zona, May 1997.

[29] J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni, M. Breunig, V. Vassalos.

Template-Based Wrappers in the TSIMMIS System.

Proceedings of the 26'th SIGMOD International Conference on Management of Data,

Tucson , Arizona , May 1997.

[30] C-H. Hsu.

Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers

and Contextual Rules.

Workshop on AI and Information Integration, in conjunction with the 15'th National

Conference on Artificial Intelligence (AAAI-98), Madison, Wisconsin, July 1998.

[31] C-H. Hsu and M-T Dung.

Generating Finite-Sate Transducers for semistructured Data Extraction From the

Web.

Information systems, Vol 23. No. 8, pp. 521--538, 1998.

[32] C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi, I. Muslea, A. G.

Philpot, S. Tejada.

Modeling Web Sources for Information Integration.

Proceedings of the 15'th National Conference on Artificial Intelligence (AAAI-98),

Madison , Wisconsin , July 1998.

[33] N. Kushmerick, D. S. Weld, R. Doorenbos.

Wrapper Induction for Information Extraction.

15'th International Joint Conference on Artificial Intelligence (IJCAI-97), Nagoya,

August 1997.

[34] N. Kushmerick.

Wrapper Induction for Information Extraction.

Ph.D. Dissertation, University of Washington. Technical Report UW-CSE-97-11-04,

1997.

[35] N. Kushmerick.

Wrapper induction: Efficiency and expressiveness.

Workshop on AI and Information Integration, in conjunction with the 15'th National

Conference on Artificial Intelligence (AAAI-98), Madison, Wisconsin, July 1998.

[36] Kushmerick, N.

Gleaning the Web.

IEEE Intelligent Systems, 14(2), March/April 1999.

[37] S. Lawrence, C.l. Giles.

Searching the World Wide Web.

Science magazine, v. 280, pp. 98--100, April 1998.

[38] A. Y. Levy, A. Rajaraman, J. J. Ordille.

Querying Hetereogeneous Information Sources Using Source Descriptions.

Proceedings 22'nd VLDB Conference, Bombay, September 1996.

[39] S. Muggleton, C. Feng.

Efficient Induction of Logic Programs.

Proceedings of the First Conference on Algorithmic Learning Theory, New York,

1990.

[40] I. Muslea.

Extraction Patterns: From Information Extraction to Wrapper Induction.

Information Sciences Institute, University of Southern California, 1998.

[41] I. Muslea.

Extraction Patterns for Information Extraction Tasks: A Survey.

Workshop on Machine Learning for Information Extraction, Orlando, July 1999.

[42] I. Muslea, S. Minton, C. Knoblock.

STALKER: Learning Extraction Rules for Semistructured, Web-based Information

Sources.

Workshop on AI and Information Integration, in conjunction with the 15'th National

Conference on Artificial Intelligence (AAAI-98), Madison, Wisconsin, July 1998.

[43] I. Muslea, S. Minton, C. Knoblock.

Wrapper Induction for Semistructured Web-based Information Sources.

Proceedings of the Conference on Automatic Learning and Discovery CONALD-98,

Pittsburgh , June 1998.

[44] I. Muslea, S. Minton, C. Knoblock.

A Hierarchical Approach to Wrapper Induction.

Third International Conference on Autonomous Agents, (Agents'99), Seattle, May

1999.

[45] S. Nestorov, S. Aboteboul, R. Motwani.

Inferring Structure in Semistructured Data.

Proceedings of the 13'th International Conference on Data Engineering (ICDE'97),

Birmingham , England , April 1997.

[46] STS Prasad, A. Rajaraman.

Virtual Database Technology, XML, and the Evolution of the Web.

Data Engineering, Vol. 21, No. 2, June 1998.

[47] J.R. Quinlan, R. M. Cameron-Jones.

FOIL: A Midterm Report.

European Conference on Machine Learning, Vienna, Austria, 1993.

[48] A. Rajaraman.

Transforming the Internet into a Database.

Workshop on Reuse of Web information, in conjunction with WWW7, Brisbane, April

1998.

[49] A. Sahuguet, F. Azavant.

WysiWyg Web Wrapper Factory (W4f).

http://cheops.cis.upenn.edu/ sahuguet/WAPI/wapi.ps.gz, University of Pennsylva-

nia, August 1998.

[50] D. Smith, M. Lopez.

Information Extraction for Semistructured Documents.

Proceedings of the Workshop on Management of Semistructured Data, in conjunction

with PODS/SIGMOD, Tucson, Arizona, May 1997.

[51] S. Soderland.

Learning to Extract Text-based Information from the World Wide Web.

Proceedings of the 3'rd International Conference on Knowledge Discovery and Data

Mining (KDD), California, August 1997.

[52] S. Soderland.

Learning Information Extraction Rules for Semistructured and Free Text.

Machine Learning, 1999.

[53] K. Zechner.

A Literature Survey on Information Extraction and Text Summarization.

Term paper, CarnegieMellonUniversity, 1997.

[54] About mySimon.

http://www.mysimon.com/about mysimon/company/backgrounder.anml

凡是有该标志的文章，都是该blog博主Caoer（草儿）原创，凡是索引、收藏
、转载请注明来处和原文作者。非常感谢。

posted on 2007-01-01 15:19 草儿阅读(1861) 评论(0) 编辑收藏所属分类: ajax 、Web Data Mining

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园博客园最新博文博问
相关文章: JSP在TOMCAT中的处理生命周期 Web信息抽取技术纵览二 Web信息抽取技术纵览一 XFire：轻松简单地开发Web Services AJAX .Net Wrapper usage guide 六步使用ICallbackEventHandler实现无刷新回调 ajax 笔记总结1 java使用sax对xml文档的解析 Ajax 学习资源

无为

公告

随笔分类(222)

随笔档案(188)

相册

收藏夹(6)

AJAX

DB BI DM

ＪＡＶＡ编程论坛

ＵＭＬ技术论坛

搜索

积分与排名

最新评论

阅读排行榜

第6.1.节总结

第6.2.节讨论

无为

公告

随笔分类(222)

随笔档案(188)

相册

收藏夹(6)

AJAX

DB BI DM

ＪＡＶＡ编程论坛

ＵＭＬ技术论坛

搜索

积分与排名

最新评论

阅读排行榜

第6.1.节 总结

第6.2.节 讨论

第6.1.节总结

第6.2.节讨论