An Open Architecture and Framework
The Unstructured Information Management Architecture (UIMA) is an architecture and framework that helps you build the bridge from unstructured information to structured knowledge. It is an industrial strength, scaleable integrating platform for composing analysis engines and integrating their results in back-end information processing systems.
Targeting large-scale solution development, UIMA allows the right skills to focus on the right parts of solution development and enables rapid integration across technologies and platforms in a host of different deployment options ranging from tightly-coupled to fully distributed allowing you to maximize single-cpu performance, flexibility and/or scale-out.
An overview of the UIMA architecture is covered in the IBM Systems Journal. For more detailed information about the architecture and the UIMA SDK please see the documentation for Apache UIMA, especially, the Conceptual Overview chapter 2.
UIMA is the engineering foundation upon which many academic project and commercial enterprises, including IBM, are developing and combining the work of many researchers and engineers to both accelerate scientific advances as well as deliver analysis into a variety of search and knowledge management applications.
In its commitment to make UIMA an open platform, IBM has donated the source code for UIMA to the Apache Software Foundation, and has moved development for UIMA into Open Source development, following the Apache way of doing things. Apache UIMA is currently (2007) in the Apache Incubator ( http://incubator.apache.org/uima), which is required of all new projects coming into Apache, and has made several releases. You can obtain the releases, as well as all the source code for UIMA at Apache, and even join the community that is continuing its ongoing development.
Architecture Highlights

Common Data Representation
At the heart of UIMA is a common representation system called the CAS or Common Analysis Structure.
The CAS is used to provide analysis engines with read access to the artifact being analyzed (e.g., document, image, video, etc) and read/write access to the analysis results or annotations associated with defined regions of the artifact. Regions may correspond to words, sentences or paragraphs in text or frames or parts of frames in video, for example.
The CAS is shared among analysis engines working in concert as part of a larger workflow to process a collection of artifacts; it is passed from one analysis engine to the next in a flow.
UIMA supports standard XML and high-speed binary serializations of the CAS. The CAS may be passed among Java and C++ analysis engines.
UIMA provides a native Java Interface to the CAS that renders analysis results as Java objects and properties making it easy for the Java programmer to interact with the CAS.
The CAS contains indexes that enable high performance access to type instances.
Plug-n-Play Analysis Engines
Analysis Engines process CASes. The look at the subject of analysis and any results produced by previous analysis engines and they discover and add more metadata to the CAS.
The logical interface for an analysis engine is simple -- CAS in/CAS out. This simplicity facilitates interoperability and composibility of independently developed engines.
Analysis engines may be organized and composed together to form reusable components that encapsulate rich workflows of cooperating engines. UIMA tooling supports this composition.
Analysis Engines can be deployed by the framework to cooperate in a single process, in different processes on the same machine, or across machines using a variety of protocols including SOAP, for example.
To find out more about these and other UIMA components see the IBM System Journal special issue on Unstructured Information and for a more detailed treatment the UIMA Documentation.
Multiple Views and Multi-Modal Support
The CAS can contain multiple views of the same logical artifact. For example, a document may be translated into different languages. Each may represent a different view of the same logical content but may be analyzed independently. A single CAS can represent all views, providing isolated or integrated access to these multiple views.
Each view is called a "Sofa" for Subject of Analysis since it can become an independent subject of different analysis engines.
Sofas come in very handy for analyzing multiple modalities, for example, the video, audio and close-captions of a video stream. Sofas can be generated on the fly. This features supports segmentation of streaming data anywhere in the analysis pipeline.
UIMA has been used as a platform for IBM’s video analysis and search system MARVEL and for a project that acquires, converts, translates, and indexes video news channels, called Tales.
Java and C++ Interoperability
Because the existing community of builders of analytic components used both C++ and Java, UIMA supports the development and deployment of analytics written in these languages (and also in some others, like Perl and Python) and supports their interoperability in both collocated and distributed deployment models through several different high-speed mechanisms
Component Packaging, Discovery and Reuse
Analysis Engine may have be developed with a host of technical dependencies. Key to component reuse is that engines can be packaged up from the environment in which they are developed and test and deployed in a different environment. UIMA includes utilities for packaging an analysis engine and all its dependent resources and installing it in a different run-time environment.
Additionally components are associated with a variety of meta-data that facilitate their discovery by solution integrators targeting specific analysis requirements.
Collection Processing and Scalability
UIM applications typically don’t stop after a single document rather they tend to process large collections of documents. Of ultimate interest is typically the aggregate analysis results collected over an entire collection of unstructured information sources.
Applications want to avoid going down (crashing) in the middle of processing millions of documents just because a single document was strangely formatted.
Additionally, applications want to scale-out to better utilize hardware resources especially given that document processing is often easily parallelized across many analysis pipelines.
UIMA addresses these needs and provides for robust failure recovery, logging, and multi-pipelining to support building scaleable unstructured information analysis and search applications.