Saturday 7 May 2005

Understanding Semantic Web (Part -2)

Understanding Semantic Web (Part-2)

As discussed in the previous part, main challenges today lies in Integration - from databases to systems to services, companies spend billions for systems to work together. Some one might ask why don't have a standard in the first place? It takes time to come up with a standard and even standards alone don't help - things move too dynamically in the the world of the PC and the Internet.

Integration is not just about integrating existing systems, but every integration kinds of breaks the previous barrier and helps build new innovative services.

Let's look towards some of the approaches on a very basic note, of the kind of problems that arise in integration.

Database Integration
--------------------
Let's assume company A has a database of Products kept in a Relational Database
Product Table A (Product ID, Name, Price (USD), Category)

And now there's another company B, who also a database of Products but with a similar structure (also within a relational database).
Product Table B (Product ID, Name,Price (Euro), Category)

The key difference in the Table A and Table B are the semantics in the Price variable. The Price in table A is in USD and Price in table B is in Euro. What's the best way to integrate two systems which have different or similar schemas of tables and more importantly, can it be "automated"? The databases don't have any notion of Price and currency, makes it harder for them to interpret.

There are mainly two key differences in data integration: Syntactic and Semantic Differences. Syntactic differences (involving syntax changes such as difference in names) are easier to resolve but semantic differences involving structural changes are much harder to resolve (for example, a particular field called Address in one table is related to multiple fields in different tables).

Then there are cases where one has to integrate a (RDBMS) relational database with an ODBMS object oriented database or any other propreitary database format. This can be automated to a large extent.

Data Integration on Internet
----------------------------
With the advent of the Internet, sharing of data became a more ubiquitous need. Plain HTML was nice, people could make excellent home pages but made it difficult for machines to share and process data. XML came as an excellent tool providing a standard for sharing structured data.

But saying having data in XML solves everything is a myth. XML at a structural level is at the same level at which any RDBMS or ODBMS or any other storage layer is with certain differences existing between each of these different systems.

RDBMS, ODBMS provide an excellent form for storage of data, providing transactional capabilites, efficient query retrieval - but they are bad when one wants to share data. One needs to have an explicit import export process depending on application needs.

XML on the other provides an excellent format for sharing structured data. Certain vendors like Tamino provide an XML database.

But otherwise weather its XML DB, RDBMS, ODBMS - essentially one has a particular schema (table structure in RDMS, XML Schema in XML) and then one populates data according to the schema.

Having data in XML solves syntactic problems to some extent but semantically it suffers from the same problems that any other database suffers from. If two companies (two publishing houses) want to share data, both of them will have their own XML formats for their respective data in XML. Integrating these two XML formats is similar to the integration problem of an RDMS discussed above.

So if there a N different formats, to have true interoperability one has NXN possibilites, which doesn't scale. The common approach is to have a common schema, and interoperate the N formats to a common schema. This reduces the number of possibilities to N. Any new format addition just has to provide a mapping to the common schema.

That's where standards come in, providing a common schema format. There are standards for every small and big stuff : VCard for Address Books, Dublin Core metadata properties for content, RosettaNet (supply chain transactions), ebXML (electronic business processes).

If everyone adheres to a common standard, looks like the Integration problem gets solved! Bingo! If every address book tool follows VCard, "I" can move my data very easily from my mobile phone, to my PDA to any tool existing on the PC or the Internet. But the world is not that simple, needs and requirements do change very frequently. And more importantly, now my address book is not a stand alone tool, I want my address books to be integrated with my email tools, with calendar tools or even my Word processors or Excel.

So how do I interoperate data between multiple domains? Should I make one big standard which covers everything? Having a common standard helps a standard way to import or export data, but there is no standard way to access data. I can do import/export of my address book but if I want my agent to do it automatically for me, its not possible. This is where the service oriented world is playing its part. In the next part, we will look into semantic web technologies (ontologies, RDF) and the role they are playing to solve the integration problem.

No comments:

Post a Comment