Introduction to Clerezza-UIMA integration
UIMA is an
OASIS standard that allows the definition of analysis pipelines to manage unstructured information and extract structures and semantics around given data.
The Clerezza-UIMA integration brings the power of UIMA inside Clerezza providing reuse of existing UIMA components, definition of new ones in a linked data oriented system.
A basic mechanism for mapping UIMA CAS (Common Analysis Structure) to an RDF graph has been defined together with the ability of storing such an object in one of the Clerezza supported triple stores.
Clerezza runs inside an OSGi environment while UIMA is not fully OSGi compliant as is, so this integration work takes care also of the OSGi adaption.
Clerezza-UIMA modules
- uima.ontologies
an ontology, and the generated Java source code, for defining the UIMA CAS model classes.
- uima.utils
base module which allows the usage of UIMA inside Clerezza.
It defines the way UIMA framework classes are instantiated and initialized with the Clerezza OSGi environment with an extension classloader which collects classloaders containing the registered UIMA analysis components. To make it possible to create a UIMA pipeline from a bundle the bundle needs to register any UIMA analysis component in the extension classloader, this can be done using a specialized OSGi Activator defined in this module.
The uima.utils module also allows to cache any previously initialized analysis engine, the execution of previously defined (a UIMA pipeline based on external services of OpenCalais and AlchemyAPI is already implemented) and custom UIMA pipelies. The module provides utility methods for retrieving UIMA annotations from the CAS model and decorating existing graph node with the information extracted by UIMA..
- uima.metadata-generator
this module contains an implementation of a Clerezza metadata generator which generates meta data about specified data sent as a sequence of bytes, analyzing the resource media type with Apache Tika and then extracting tags, concepts, language and other entities with uima.utils external services based UIMA pipeline.
- uima.casconsumer
a CAS Consumer in UIMA is an analysis component which is responsible of consuming the annotations and feature structures contained in a CAS (or the CAS itself) in some way. The ClerezzaCASConsumer contained in this module can map information contained in a CAS to a (RDF) graph and eventually store it inside a triple store. The mapping strategy can be configured and extended; the current implementations count a default mapping implementation based on the basic uima.utils mapping strategy and a mapping based on
Annotation Ontology.
- uima.concept-tagging
this module provides a UIMA enabled version of the base Clerezza concept tagger which is able to automatically annotate a node with concept tags. Also another service to automatically enhance an external resource (given the URI), write it in the triple store with the Clerezza CASConsumer and return an RDF version of the annotated resource.
Getting started