BlogJava 联系 聚合 管理  

Blog Stats

文章档案


hanbing

UIMA is an Open, Industrial-Strength Platform for Unstructured Information Analysis and Search

http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.index.html


Industrial Strength Platform

UIMA stands for the Unstructured Information Management Architecture.

It is an open, industrial-strength, scaleable and extensible platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components.

Although UIMA originated at IBM, it has now moved on to be an Open Source project which is currently incubating at the Apache Software Foundation: http://incubator.apache.org/uima.

UIMA's goal is to provide a common foundation for industry and academia to collaborate and accelerate the world-wide development of technologies critical for discovering the vital knowledge present in the fastest growing sources of information today.

IBM has empowered its products and services with UIMA creating a channel for third-party vendors to deploy their text and multi-modal analytics in larger integrated solutions.

The premier product platforms that exposes the UIMA interfaces to the customer are IBM OmniFind Enterprise Edition and Analytics Edition. The former features UIMA for building full-text and semantic search indexes, and the latter deploys UIMA for information extraction and text analysis.

A general introduction to its use of UIMA with some sample applications can be found on here.

To try out the UIMA software framework download Apache UIMA from the Apache UIMA Site.

Building a Community

UIMA is now an Open Source project at the Apache Software Foundation (ASF), where it is undergoing incubation, required of all new Apache projects. We are building an open, world-wide community of users and developers, and continuing UIMA's evolution using the Open Source developement paradigm.

In addition to IBM, many universities and industrial organizations are using UIMA to develop analysis engines and UIM solutions.

UIMA Innovation Awards

IBM has established an award program for University Faculty called the UIMA Innovation Awards, and given it for the past several years, to encourage building the community around UIMA. More information on this program can be found at here.

UIMA is being Standardized at OASIS

UIMA is undergoing a standardization effort at OASIS, via the OASIS Unstructured Information Management Architecture (UIMA) TC, working on standardizing semantic search and content analytics. This committee, open to all interested parties who are members of OASIS, has been meeting regularly, and plans to publish a first draft of their work in early 2008.

Third Party Vendors and UIMA

A host of third-party vendors have announced their use of UIMA to wrap and deploy their analysis capabilities and to build UIMA-based solutions. For details see the August 2005 OmniFind Press Release.

You can find more UIMA annotators on the internet using the internet search engines with the phrase "uima annotators". Carnegie Mellon University and Jena University both have repositories of UIMA components.

 

To learn more about UIMA as a open platform for building unstructured information and knowledge management applications read on.




Attached file: IndustrialStrength.png

The Knowledge Rush

Unstructured Information – The Knowledge Rush

Knowledge

Unstructured information represents the largest, most current and fastest growing source of knowledge available to businesses and governments world-wide.

The web is just the tip of the iceberg. Consider the droves of corporate documentation ranging from best-practices, technical reports, problem reports, customer communications and contracts to emails and voice mails. In these mounds of natural language artifacts often lie the nuggets of knowledge critical for realizing important trends, creating new opportunities, solving problems or preventing disasters.

  • Shaving off just seconds per call to find the right technical documentation in call-centers can save millions.

  • Rapidly detecting emerging trends in problem-reports coming in from all over the globe can avoid recalls and save companies and their customers millions if not billions.

  • Detecting otherwise unrealized drug interactions through analyzing the linkages in of medical abstracts can help prevent disaster as well as help discover new drugs or cures.

  • Analyzing communications linked to terrorist networks in the form of multi-lingual text or other modalities can help uncover plots threatening national security before they happen.

  • Analyzing SEC reports to help evaluate corporate financial positions

  • And many more applications…

Applications like these, which rely on the rapid discovery of vital knowledge, require the analysis of unstructured information. This is all the information that has NOT been carefully encoded in enterprise databases but rather exists as natural language text, speech or video.

Unstructured information includes the documents found on the web, plus an estimated 80% of the information generated by enterprises around the world. The principal challenge with unstructured information is that it needs to be analyzed in order to identify, locate and relate the entities and relationships of interest – discover the vital knowledge contained therein.

Once these entitles and relationships are detected they may be indexed in structured forms so that powerful search technologies like search engines and database engines can efficiently find the knowledge you need, when you need it.

The bridge from the unstructured world to the structured is enabled by the software agents that do the analysis. These can scan a text document, for example, and pull out chemical names and their interactions, or identify events, locations, products, opinions about products, problems, methods etc. UIMA calls these software agents – analysis engines.

The Essential Analysis — A Best of Breed Integration

There are all kinds of analysis engines being developed in industry and academia. Each tends to be highly specialized in solving small and different parts of an overall solution. Some engines, for example, specialize in breaking up documents into individual words (simple perhaps for white-space delimited languages but Chinese, for example, is another story). This is just the first step in the process.

To accurately detect and classify domain-specific knowledge, deeper analysis is required. This may depend on part-of-speech detection, grammatical parsing and named-entity recognition where proper names, organizations and locations are identified. Other engines may specialize in detecting events and times and then others work on detecting relationships between these elements. A variety of techniques may be used to develop these specialized engines including rule-based and statistical machine learning algorithms.

Analysis engines may vary along a variety of dimensions including document modality (text, speech, video), format, natural language, style, domain. And they may make different performance tradeoff favoring for example, precision over speed or recall over precision.

The critical point is that to develop a complete solution that takes you from unstructured information to usable knowledge you must integrate a variety of independently developed analysis engines. These must be integrated to perform a comprehensive analysis task and then their results must be funneled into systems that allow users to rapidly find and exploit the discovered knowledge, for example, search engines, databases and/or knowledge bases.

IBM’s Mission in Unstructured Information Management (UIM)

An Unstructured Information Management (UIM) solution may be generally characterized as a software system that analyzes large volumes of unstructured information (text, audio, video, images, etc.) to discover, organize and deliver relevant knowledge to the client or application end-user. An example is an application that processes millions of medical abstracts to discover critical drug interactions. Another example is an application that processes tens of millions of documents to discover key evidence indicating probable competitive threats.

Acknowledging the tremendous value in unstructured information sources, IBM products and services centered about information integration are powered by UIMA and positioned to leverage increasingly sophisticated analytics to deliver greater and greater value to our customers.

Analysis engines and related resources for building UIM solutions will come from a wide variety of vendors. IBM wants to ensure that our products and services can exploit best-of-breed combinations of these technologies to deliver the best end-to-end solutions to our customers.

We want to encourage Apache UIMA’s broad adoption to cultivate a world-wide community focused on the development, refinement and integration of advanced analysis technologies that will help enhance our solutions and drive this industry forward.

Download Apache UIMA

 

UIMA Architecture Highlights

An Open Architecture and Framework

The Unstructured Information Management Architecture (UIMA) is an architecture and framework that helps you build the bridge from unstructured information to structured knowledge. It is an industrial strength, scaleable integrating platform for composing analysis engines and integrating their results in back-end information processing systems.

Multi-Node deployment platform

Targeting large-scale solution development, UIMA allows the right skills to focus on the right parts of solution development and enables rapid integration across technologies and platforms in a host of different deployment options ranging from tightly-coupled to fully distributed allowing you to maximize single-cpu performance, flexibility and/or scale-out.

An overview of the UIMA architecture is covered in the IBM Systems Journal. For more detailed information about the architecture and the UIMA SDK please see the documentation for Apache UIMA, especially, the Conceptual Overview chapter 2.

UIMA is the engineering foundation upon which many academic project and commercial enterprises, including IBM, are developing and combining the work of many researchers and engineers to both accelerate scientific advances as well as deliver analysis into a variety of search and knowledge management applications.

In its commitment to make UIMA an open platform, IBM has donated the source code for UIMA to the Apache Software Foundation, and has moved development for UIMA into Open Source development, following the Apache way of doing things. Apache UIMA is currently (2007) in the Apache Incubator ( http://incubator.apache.org/uima), which is required of all new projects coming into Apache, and has made several releases. You can obtain the releases, as well as all the source code for UIMA at Apache, and even join the community that is continuing its ongoing development.

Architecture Highlights


UIMA Architecture Diagram

Common Data Representation

At the heart of UIMA is a common representation system called the CAS or Common Analysis Structure.

The CAS is used to provide analysis engines with read access to the artifact being analyzed (e.g., document, image, video, etc) and read/write access to the analysis results or annotations associated with defined regions of the artifact. Regions may correspond to words, sentences or paragraphs in text or frames or parts of frames in video, for example.

The CAS is shared among analysis engines working in concert as part of a larger workflow to process a collection of artifacts; it is passed from one analysis engine to the next in a flow.

UIMA supports standard XML and high-speed binary serializations of the CAS. The CAS may be passed among Java and C++ analysis engines.

UIMA provides a native Java Interface to the CAS that renders analysis results as Java objects and properties making it easy for the Java programmer to interact with the CAS.

The CAS contains indexes that enable high performance access to type instances.

Plug-n-Play Analysis Engines

Analysis Engines process CASes. The look at the subject of analysis and any results produced by previous analysis engines and they discover and add more metadata to the CAS.

The logical interface for an analysis engine is simple -- CAS in/CAS out. This simplicity facilitates interoperability and composibility of independently developed engines.

Analysis engines may be organized and composed together to form reusable components that encapsulate rich workflows of cooperating engines. UIMA tooling supports this composition.

Analysis Engines can be deployed by the framework to cooperate in a single process, in different processes on the same machine, or across machines using a variety of protocols including SOAP, for example.

To find out more about these and other UIMA components see the IBM System Journal special issue on Unstructured Information and for a more detailed treatment the UIMA Documentation.

Multiple Views and Multi-Modal Support

The CAS can contain multiple views of the same logical artifact. For example, a document may be translated into different languages. Each may represent a different view of the same logical content but may be analyzed independently. A single CAS can represent all views, providing isolated or integrated access to these multiple views.

Each view is called a "Sofa" for Subject of Analysis since it can become an independent subject of different analysis engines.

Sofas come in very handy for analyzing multiple modalities, for example, the video, audio and close-captions of a video stream. Sofas can be generated on the fly. This features supports segmentation of streaming data anywhere in the analysis pipeline.

UIMA has been used as a platform for IBM’s video analysis and search system MARVEL and for a project that acquires, converts, translates, and indexes video news channels, called Tales.

Java and C++ Interoperability

Because the existing community of builders of analytic components used both C++ and Java, UIMA supports the development and deployment of analytics written in these languages (and also in some others, like Perl and Python) and supports their interoperability in both collocated and distributed deployment models through several different high-speed mechanisms

Component Packaging, Discovery and Reuse

Analysis Engine may have be developed with a host of technical dependencies. Key to component reuse is that engines can be packaged up from the environment in which they are developed and test and deployed in a different environment. UIMA includes utilities for packaging an analysis engine and all its dependent resources and installing it in a different run-time environment.

Additionally components are associated with a variety of meta-data that facilitate their discovery by solution integrators targeting specific analysis requirements.

Collection Processing and Scalability

UIM applications typically don’t stop after a single document rather they tend to process large collections of documents. Of ultimate interest is typically the aggregate analysis results collected over an entire collection of unstructured information sources.

Applications want to avoid going down (crashing) in the middle of processing millions of documents just because a single document was strangely formatted.

Additionally, applications want to scale-out to better utilize hardware resources especially given that document processing is often easily parallelized across many analysis pipelines.

UIMA addresses these needs and provides for robust failure recovery, logging, and multi-pipelining to support building scaleable unstructured information analysis and search applications.

UIMA and Semantic Search

Click here to view a powerpoint presentation on Semantic Search and UIMA.

Did you ever type in a keyword search query and get hundreds of thousands of documents back and think, how you would ever expect to sift through all those hits, list after hit list?

A powerful thing you can do with the results of UIMA analysis is to enable more effective search systems – systems that can more precisely target your intended interest.

Semantic Search is a class of document retrieval that allows the user to exploit the results of UIMA analysis to create much more effective queries – queries that can hone in on exactly what you are looking for.

OmniFind search and, on a smaller scale, the semantic search engine available from IBM's alphaWorks (called SemanticSearch) can exploit the additional information from the UIMA CAS to implement more powerful and precise queries.

For example, imagine a user is looking for documents that mention an organization with “center” in its name, but is not sure of the full or precise name of the organization.

A key-word search on “center” would likely produce way too many documents because “center” is a common and ambiguous term. Our semantic search engine supports a query language called XML Fragments. This query language is designed to exploit UIMA’s CAS annotations entered in the search engine’s index. The XML Fragment query, for example,

<organization> center </organization>

will produce first only documents that contain “center” where it appears as part of a phrase annotated as an organization by a named-entity recognizer. This hit list will be a much shorter list of documents more precisely matching the user’s interest.

Consider taking this a step further. We can add a relationship recognizer to the UIMA pipeline that annotates mentions of the “CEO of” relationship. We can then configure the CAS Consumer so that it sends these new relationship annotations to the semantic search index as well. With these additional analysis results in the index we can submit queries like

<ceo_of>

<person>center </person>

<organization>center </organization>

<ceo_of>

“Center” is a common word with over 13 different meanings, but this query will zoom in on those documents that contain the word used as the name of a person or in the name of an organization.

Furthermore, it will favor those documents where “center” the person, is the “CEO of” an organization that shares the name. The semantic search engine would include as top hits documents with

“…Fred Center, CEO of Center Micros…” or

“…The CEO of Center Systems, Mr. Center…”

Where phrases like “...the center of the circle...” or “...Mr. Center threw the ball to the center of the team…” would not match.1

This kind of precision is the power that UIMA plus semantic search can bring to your applications.

1 The query as exactly shown would include less precise matches but rank them lower. The query can be further specialized to exclude anything but exact matches as suggested here.

Availability

UIMA is available from the Apache Software Foundation's incubator website, http://incubator.apache.org/uima.

Both ready-to-use binary distributions, as well as all of the source code for UIMA are available.

Apache UIMA maintains two publicly archived mailing lists, one for the ongoing development of UIMA, and the other for users of the framework. There is also available a wiki for Apache UIMA.

The UIMA framework has been embedded in a variety of supported IBM products enabling them to leverage advanced analytics.

In particular the UIMA APIs are available for creating customized solutions in IBM OmniFind Enterprise Edition.

UIMA Architecture and Framework

The UIMA architecture and software framework is continually being advanced at IBM Research in light of requirements coming from many different areas including:

  • text and multi-modal analytics
  • machine translation systems
  • transcription systems
  • bioinformatics
  • high throughput analysis systems
  • knowledge integration
  • semantic search
  • program analysis
  • social network analysis
  • question answering
  • call-centers
  • change detection
  • security and
  • semantic web applications.

Research Contact(s): Dr. David Ferrucci (ferrucci@us.ibm.com)

Related Papers:

D. Ferrucci and A. Lally. "UIMA: an architectural approach to unstructured information processing in the corporate research environment," Natural Language Engineering 10, No. 3-4, 327-348 (2004).

D. Ferrucci and A. Lally, "Building an example application with the Unstructured Information Management Architecture," IBM Systems Journal 43, No. 3, 455-475 (2004).

T. Goetz and O. Suhre "Design and implementation of the UIMA Common Analysis System," IBM Systems Journal 43, No. 3, 490-515 (2004).

Anthony Levas, Eric Brown, J. William Murdock, and David Ferrucci. "The Semantic Analysis Workbench (SAW): Towards a Framework for Knowledge Gathering and Synthesis." Proceedings of the International Conference on Intelligence Analysis. McClean, VA, May 2-6, 2005.

See other UIMA Related Projects for links to more papers.


IBM Technology to Automate Customer Satisfaction Analysis

IBM Technology to Automate Customer Satisfaction Analysis is a business intelligence technology for customer satisfaction analysis targeted at customer-centric enterprises such as contact centers. This project analyses agent-customer interaction data and extracts information related to various reasons why customers are satisfied or dissatisfied about the service. After training and tuning based on archived interactions, it can automatically analyze new cases and extract business intelligence. This tool automates manual processes, thereby reducing effort as well as analysis time.

The heart of the system, built using the UIMA framework, is based on machine learning and natural language processing.

Statistical and rule-based models needed by this technology are custom created by training the system on historical data.

This technology has been picked up by IBM Daksh, an IBM subsidiary company that provides Business Transformation Outsourcing solutions.


posted on 2008-09-17 16:27 睡得惊动了党 阅读(326) 评论(0)  编辑  收藏

只有注册用户登录后才能发表评论。


网站导航: