Tuesday, December 20, 2011

Data Mapping with Inference and Feedback

We've worked with thousands of companies for most of the 1990's and early Web 2.0 era.  Every Medium to large enterprise has typically struggled with data integration projects. Every new acquisition, system or IT project creates a new integration project. To make matters worse, there are no standard crosswalks for data mapping. This problem is not only epidemic, but increasingly neglected by many enterprises. David Luckham hinted at “IT Blindness” when a company makes incredible blunders that are compounded by false beliefs, often generated by a lack of real data or the inability to process events (both simple and complex). David has developed a set of patterns for solving some of these issues (Complex Event Programming or CEP), yet the events themselves still must be minable for data that can then be integrated.

The Problem:

Data mapping has historically been a rather time consuming practice, often done manually. There are multitudes of issues with data mapping, some of which are dependent upon the context in which instance data might appear. To illustrate this point, let us assume that we could create a single data dictionary of all the terms used in business. This approach has been tried many times with various EDI and XML dialects. Defining a simple data element such as one that would denote the first name of a human being should be easy, correct? The definition itself is not the issue, it is the ability to map it automatically when encountered. The logic of context often makes this hard. Take this data element for example:

Element Name: FirstNameOfPerson 

Type: String64

Description: a string value representing the legal first name of a human being.

We could easily serialize this into an XML element as Duane. Now account for the fact that we must map this data format into a second format that has an element and semantics as follows:

Element Name: PersonFirstName 

Type: String

Description: a string value representing the legal first name of a human being.

It might be easy to figure out that in a vacuum this is pretty straight forward. The challenge comes when the aspect of “context” is added. To illustrate this issue, consider the following data structures:




While both use the same data element for the first name of a person, the semantics (or pragmatics rather) are slightly different based on the hierarchy and context. If both of these appear on the input side, they cannot be mapped to any instance of the PersonFirstName (the second example above) without contemplating the special nature each context brings. The meaning is the first name of a person but the two are not equal. One is the first name of the buyer party and the other is the first name of the seller party. Not immediately apparent is that the instance data set is now also bound to a process (procurement in this case).

The approach of manual data mapping has been around for a few decades. Automating this process is extremely difficult. A processor must be able to account for subtle differences in mapping rules based on a number of things. Even with the best schema and metadata support, exceptions and errors are likely to be encountered.

Computational Intelligence (CI) approach caught my eye the other day. We at Technoracle have studied this problem for a number of years. The CI approach combines an inference engine with a graphical user interface. As input data is encountered, the user interface guides users by uggesting optimal mapping scenarios. Unlike more traditional approaches to auto-mapping that require a significant amount of preparatory work, the inference approach semi-automates some of the work.

Disclosure:  we were contacted by an agent for Contivo to write about their system.  No consideration was paid in exchange for this blog post.  Technoracle reviews technology and does not speak for or make claims as a representative of the companies we highlight.

The approach espoused by one company in particular has caught our eye. Liaison’s Contivo (http://liaison.com/products/transform/contivo) builds reusable mappings by associating the metadata with a semantic "dictionary". The method uses an analytics model to parse incoming data, then it references that input against a dictionary that captures and stores mapping graphs. The dictionary is portable and can be leveraged by future transformation maps.

Liaison’s Contivo then establishes an integration vocabulary and thesaurus that may be fine tuned by manual methods. Contivo then leverages the vocabulary and thesaurus to automate data transformation and reconciliation tasks that are traditionally implemented using manual mapping techniques.

Figure – a snapshot of the Contivo Mapping

This approach was the basis for the long term product roadmap in XML Global Technologies, a company co-foundered in the dot com era. Their plan was use the mapping graphs built from their GoXML Transform product (now part of Xenos Group) and store these maps into a metadata Registry/Repository organized using a business ontology so they could be accessed by an entire community of users rather than one single enterprise. This approach made a lot of sense back then and makes a lot today. It also mitigates the issues of changing schemata and EDI vocabularies.

The problem has not gone away. There is a lot of great work being done my companies who can automate the mapping of integration data into known system. Using a feedback loop such as Contivo helps a system evolve over time and can facilitate a much more intelligent approach to solve this problem.

A long term architecture Contivo might consider is to use a social approach to learning via a centralized repository of mapping knowledge. Each of the users systems could continuously update and commit to a central knowledge base that uses the global trade dictionaries and various EDI and XML business dialects alongside a feedback circuit to learn the finer nuances of data translation.

We are left wondering if a standard should be developed for declaring reusable mapping graphs and if so, who should develop it. Many open data initiatives would benefit from this as would those who use the open data.

1 comment:

Do not spam this blog! Google and Yahoo DO NOT follow comment links for SEO. If you post an unrelated link advertising a company or service, you will be reported immediately for spam and your link deleted within 30 minutes. If you want to sponsor a post, please let us know by reaching out to duane dot nickull at gmail dot com.