My Blog: May 2005

Friday 27 May 2005

Blogging As a Career

Gawker is trying to set up a model for advertising-supported weblogs. Gawker Media blogs include the popular gossip sites Gawker, Wonkette and Defamer, gadget blog like Gizmodo and others. Bloggers are paid $2500 a month and the blog aims to earn 75K $ per annum. And yes, you can be an intern too. IWantMedia has a story published on Gawker.

Mobile Search

Got this interesting piece from Mobile Technology Weblog about Mobile Search getting hotter.

Firstly, 30% of searches are currently to look for mobile content (ringtones etc). Since about 2/3 of mobile content is currently sold via operator portals, this is a clear and present danger for operator revenues. In other words, while they may make money from the advertisers paying for their ads to be presented to users, many of these ads will be for competitors of the operators.

Complete article here.

Wednesday 25 May 2005

Understanding Semantic Web (Part -3)

In the previous two posts (part1 and part2)about Semantic Web, I mainly talked about the problems and approaches people follow towards making data and services more seamless. From databases, to service oriented middleware; issues of integration and challenges that lie ahead are huge.

What's semantic web going to change? Will this make the whole system work automatically? Rather than predicting the answer, let's try to walk into approaches that the Semantic Web community is trying to apply.

At the core of Semantic Web techniques, there exists a data model (like Relational DB Model, Object Oriented Model, XML) called RDF (Resource Description Framework). RDF is based on XML from a syntactical point of view, but semantically there are huge differences between the two. RDF is mainly a graph oriented model and XML is a hierarchical model. RDF enables one to make statements about resources. So if one wants to say "John has age 35", the RDF data model enables one to represent this statement.

Consider the above statement as a tuple ( John, hasAge, 35) where John is the Subject, hasAge is predicate (property) and the value 35 is an Object (Literal).

RDF enables one to make statements and also statements about statements. Extending the above example:
John owns Ferrari. The color of Ferrari is red.

Equivalent tuples:
(John, owns, Ferrari) (Ferrari, hasColor, Red)

So RDF allows one to represent statements in form of triples and then aggregate such triples (subject, predicate and object) pairs.

One thing that makes RDF interesting is the use of URI's. URI's are Unique Resource Identifiers via which one can identify a resource uniquely. e.g. URL is a particular type of URI. So how do URI's help?

Let's consider a scenario where I represent certain information about myself on my website in RDF. (Sunil, hasAge, 26) (Sunil livesIn, New Delhi)
Lets say I identify Sunil using the URI http://enventure.blogspot.com/person#sunil
This URI is unique for Sunil and anyone using the above URI anywhere refers to the same resource.
I identify the property hasAge by URI http://enventure.blogspot.com/person#hasAge
and livesIn by http://enventure.blogspot.com/person#livesIn

So triples can be represented via
(http://enventure.blogspot.com/person#Sunil , http://enventure.blogspot.com/person#hasAge, 26)

(http://enventure.blogspot.com/person#Sunil, http://enventure.blogspot.com/person#livesIn, New Delhi)

And now lets say my friend Lomesh wants to talk about himself on his website.

(http://lomesh.blogspot.com/person#Lomesh , http://lomesh.blogspot.com/person#hasAge, 26)
(http://lomesh.blogspot.com/person#Lomesh, http://lomesh.blogspot.com/person#livesIn, Sacramento)

And now say Lomesh wants to refer Sunil and say Lomesh is a friend of Sunil or want to enrich some more information about Sunil, he just adds the equivalent triples to his website.

(http://lomesh.blogspot.com/person#Lomesh , http://lomesh.blogspot.com/person#friendOf, http://enventure.blogspot.com/person#Sunil)

RDF is flexible enough to allow anyone to add any kind of triples identified by URI's. Any RDF aggregator, can aggregarte all such triples together and then do reasoning on top of it. In Object Oriented Models, every class has two main components : the data or properties it identifies (e.g name, age for Class Person) and the methods (getName(), getAge() ) which operate on data. In RDF model, data lies outside the class definition enabling anyone to add any data or property at will.

So if a future search engine, aggregates triples that exist on both Lomesh's and Sunil's website, the search engine will be able to integrate the enriched information together (URI's enable to do the corresponding matching) and present a more coherent picture.

But for this to happen, one key thing that needs to be solved is standardizing vocabularies! If everyone defines his own vocabularies of hasAge, livesIn, friendOf by his own independent URI's, search engines will still be confused as they will not be able to do any interpretation and we would still be in the syntactical world.

Vocabularies/Ontologies bring semantics to the RDF world. In essence any kind of structred data varying from a dictionary, thesaurus, categories can be considered to be an ontology, variation being the richness. Ontologies differ from vocabularies as they try to map the human or world knowledge into structure and other being its a shared knowledge, so it has to be AGREED UPON. Defining your own ontology isn't actually an ontology, its just another vocabulary.

So lets say if we want to define person via an ontology. We can define an ontology such as:
(Person hasName String)
Person is a class or a concept. It has a property "hasName" whose value is a String.

Similary other properties can be attatched or new concepts can be defined.
(Person hasAge Numeric)
(Person livesIn City)
(City isPartof State)
(State isPartof Country)
(Country isPart of World)
(India isInstanceOf Country)

The above relationships when combined together form a graph like structure where entities(subjects or objects) are related by certain properties. Ontologies can vary from being very simple to being very complex. There are ontologies for cultural domain, education, beer and wine, Persons, Address Books etc. Then there are ontologies which link multiple ontologies or act as upper level ontologies. A good resource for ontologies is http://www.schemaweb.info

Another data schema model provided by Semantic Web community is RDFS (RDF-Schema) and newer ones like OWL (Web Ontology Language extension of RDF-S). RDF-S and OWL provide constructs to build ontologies. e.g. There are constructs like (instanceof, subClass, subProperty). These constructs enable one to reason about things.

Example:
(City subClassOf Country)
(NewDelhi instanceof City)
(India instanceOf Country)

From the above three constructs, one can deduce NewDelhi is a part of India.
Languages like OWL provide more richer constructs where one can say do cardinality constraints.
e.g. One can represent the following two sentences using OWL.
Person owns a Car. Person can own 0 or more cars.

RDF-S and OWL being based on RDF use URI's to distinguish concepts and properties. So anyone can establish an ontology and anyone else can make extensions to it.

Taking the previous example of Sunil and Lomesh, there exists now two things to make data semantically rich:
An RDF-S based ontology for defining people and friends.
And particular instances Sunil and Lomesh who use that ontology to make particular statements. Any third person can use or refer to the ontology or triples written by Lomesh or Sunil.

An intelligent search engine should be able to aggregate all such triples, combine ontologies, combine the instances and then do intelligent reasoning based on that.

RDF-S, OWL, RDF all such technologies come from Artifical Intelligence background. There have been expert systems around which did most of all the stuff and much more than what RDF or OWL does. The key thing that has changed between the past and now is the WEB. The previous systems were closed systems that were intended to solve a particular AI problem, but the current ones are being designed keeping in the view the ubiquotous nature of the web.

But the problems that previous AI systems faced, semantic web community will still need to overcome them. Most of the problems revolve around ontology building, ontology maintenance, ontology merging , ontology mapping and weather do we need any complex ontologies? Most of the ontologies have grown out to be complex, sometimes far from an average individual's comprehension. What one needs is the application of software engineering practices to semantic web technologies. In the next sections, I will go through some particular challenges and use case scenarios where Semantic Web Techniques are being applied.

Monday 23 May 2005

Journey Back From Home

Last 3 days I was at home, relaxing. Days at home make you feel relaxed from all the fast life of Delhi. When I am in Delhi, it looks like hundreds of things are running here and there and your life passes just trying to be in pace with the world around you. But at home its completely opposite. There's just peace and calmness all around you. May be its because I belong to a small town and people care more, less pollution, and there's a sense of belongingness which one doesn't find in bigger cities.

Coming back from home on train, seeing the world around me, made me think again - what will be the face of India - 20 or 30 or even 50 years from now? There is huge disparity between Delhi and the smaller towns and villages. You just need to watch and see the real India from that small open window of your seat and you start realizing the reality. People still living in open, no water, no electricity, and no sanitation! Though I have seen these things much closer in my native village around 15 years back but there are lot many places where life is still the same or may be worse. You just need one train journey in a general compartment to realize where we are.

We the technologists, the software engineers, are just so much overwehlmed by IT, that we don't realize the other face of the coin. Having seen so many things in europe and in india, will we ever see a modern india (where people atleast get all the basic necessities of life) in our life span? I once asked my uncles and aunts the same question, and their answer was a strong NO. I hope this NO changes into Yes, even though I myself am a bit reluctant in making that statement. But I still hope ...

Sunday 15 May 2005

Change the World

Tim Draper from DFJ gave a presentation at Stanford as part of the ETL series. The video is here.
Most interesting bit is the the song for Entrepreneurs (Risk Master):-).

The best part of an entrepreneur is to go against the odds and change the world. So just follow what you believe in and just do it.

Monday 9 May 2005

Structured Blogging and Prospective Search

Via Bob Wyman from PubSub:

With Structured Blogging, we'll be able to post structured items in any of millions of blogs or web sites and have those items recognized, indexed, and searched on any number of search sites -- just like HTML pages are today. No longer will we need to rely on going to a small number of centralized, walled-garden, closed sites like MeetUp, eBay, Monster, or EVDB to publish or search for the kind of information that requires structure. Common search engines like Google, Yahoo! and PubSub will be able to usefully index this data. In the future, as search engines come to better support structured data, we will all benefit as the Gray Web grows smaller and the visible web grows larger.

Saturday 7 May 2005

Understanding Semantic Web (Part -2)

Understanding Semantic Web (Part-2)

As discussed in the previous part, main challenges today lies in Integration - from databases to systems to services, companies spend billions for systems to work together. Some one might ask why don't have a standard in the first place? It takes time to come up with a standard and even standards alone don't help - things move too dynamically in the the world of the PC and the Internet.

Integration is not just about integrating existing systems, but every integration kinds of breaks the previous barrier and helps build new innovative services.

Let's look towards some of the approaches on a very basic note, of the kind of problems that arise in integration.

Database Integration
--------------------
Let's assume company A has a database of Products kept in a Relational Database
Product Table A (Product ID, Name, Price (USD), Category)

And now there's another company B, who also a database of Products but with a similar structure (also within a relational database).
Product Table B (Product ID, Name,Price (Euro), Category)

The key difference in the Table A and Table B are the semantics in the Price variable. The Price in table A is in USD and Price in table B is in Euro. What's the best way to integrate two systems which have different or similar schemas of tables and more importantly, can it be "automated"? The databases don't have any notion of Price and currency, makes it harder for them to interpret.

There are mainly two key differences in data integration: Syntactic and Semantic Differences. Syntactic differences (involving syntax changes such as difference in names) are easier to resolve but semantic differences involving structural changes are much harder to resolve (for example, a particular field called Address in one table is related to multiple fields in different tables).

Then there are cases where one has to integrate a (RDBMS) relational database with an ODBMS object oriented database or any other propreitary database format. This can be automated to a large extent.

Data Integration on Internet
----------------------------
With the advent of the Internet, sharing of data became a more ubiquitous need. Plain HTML was nice, people could make excellent home pages but made it difficult for machines to share and process data. XML came as an excellent tool providing a standard for sharing structured data.

But saying having data in XML solves everything is a myth. XML at a structural level is at the same level at which any RDBMS or ODBMS or any other storage layer is with certain differences existing between each of these different systems.

RDBMS, ODBMS provide an excellent form for storage of data, providing transactional capabilites, efficient query retrieval - but they are bad when one wants to share data. One needs to have an explicit import export process depending on application needs.

XML on the other provides an excellent format for sharing structured data. Certain vendors like Tamino provide an XML database.

But otherwise weather its XML DB, RDBMS, ODBMS - essentially one has a particular schema (table structure in RDMS, XML Schema in XML) and then one populates data according to the schema.

Having data in XML solves syntactic problems to some extent but semantically it suffers from the same problems that any other database suffers from. If two companies (two publishing houses) want to share data, both of them will have their own XML formats for their respective data in XML. Integrating these two XML formats is similar to the integration problem of an RDMS discussed above.

So if there a N different formats, to have true interoperability one has NXN possibilites, which doesn't scale. The common approach is to have a common schema, and interoperate the N formats to a common schema. This reduces the number of possibilities to N. Any new format addition just has to provide a mapping to the common schema.

That's where standards come in, providing a common schema format. There are standards for every small and big stuff : VCard for Address Books, Dublin Core metadata properties for content, RosettaNet (supply chain transactions), ebXML (electronic business processes).

If everyone adheres to a common standard, looks like the Integration problem gets solved! Bingo! If every address book tool follows VCard, "I" can move my data very easily from my mobile phone, to my PDA to any tool existing on the PC or the Internet. But the world is not that simple, needs and requirements do change very frequently. And more importantly, now my address book is not a stand alone tool, I want my address books to be integrated with my email tools, with calendar tools or even my Word processors or Excel.

So how do I interoperate data between multiple domains? Should I make one big standard which covers everything? Having a common standard helps a standard way to import or export data, but there is no standard way to access data. I can do import/export of my address book but if I want my agent to do it automatically for me, its not possible. This is where the service oriented world is playing its part. In the next part, we will look into semantic web technologies (ontologies, RDF) and the role they are playing to solve the integration problem.

Wednesday 4 May 2005

RSS Filters

Today morning while reading this blog, from Piers Young, I started thinking about a possible solution to the information overload problem in RSS. People do subscribe lot of RSS feeds but its very hard to read each and every bit of information.

RSS Feeds especially blogs written by personal friends or people are just as personal like Emails. Even in case of public blogs, one might need to skim certain entries, but its hard to keep track of 100 or more feeds within an aggregator. One possible approach is to develop email like spam filters for filtering in RSS tools. In case there is there is excessive information from one particular feed, it can be filtered very easily.

So within my RSS Tool, I should be able to specify:
- Remove similar entries that have been read. (Dump them separately, so I can see if I want)
- Filter entries via keywords (Should be able to read unfiltered ones, if I want)
- Views based on Filters - Have a views like capability, where a tag view is created from entries that exist from within all subscribed RSS feeds.
- Person based view - All discussions and comments given by a person appear in a person view.
- View based Clustering of RSS feeds.

This might be one possible approach to solve the information overload within RSS atleast in the short run.