Words, Ideas, and Things: What Is The Semantic Web?

The Semantic Web is—or is hoped to be—the next revolution in the way the Internet is used, just as the World Wide Web was a revolution in the way the Internet was used. To get some perspective, we need to look back at history.

Before the Internet, computers existed as standalone machines, possibly with multiple monitor/keyboard terminals spread around a building. For long distance connections, wired circuits (think modems) had to be be brought up and then maintained throughout a session. Local networks existed, but each network vendor had its own incompatible system. There wasn't a standard way of communicating across networks.

The Internet began as a U.S. Department of Defense project to connect research universities. By the end of 1969, networks at four universities were connected to each other. In 1983, the communication standard of this inter-network ("between"-network) was changed to the TCP/IP protocol suite, which is still the basis of Internet communication today. With an IP address (e.g. 203.0.113.100) and a port (e.g. 25), a computer in California can connect to the email program on a computer in Germany and leave a message for a user there. Or, slightly more user friendly, a kid growing up in rural North Dakota could use the telnet application to connect to a domain name (genesis.cs.chalmers.se) along with a port (3011) to play a text adventure game running on a university server in Sweden. (They've changed the address a bit since I was in high school.)

There was useful and fun stuff going on before the World Wide Web, but it was hard to discover new resources. It's hard to believe now, but it was common in those days to learn about Internet sites by reading about them in books. Paper books! Sure, there were Gopher servers with manually-maintained hierarchical categories of Internet resources, but these directories didn't keep up very well and the resources didn't usually link to each other.

The World Wide Web began as an internal project at CERN, the particle physics research center on the border of Switzerland and France. Researchers needed a better way to organize their information in a busy environment with lots of job turnover, so Tim Berners-Lee proposed a solution for CERN intentionally designed to work on a global scale as well. He wrote:

"a 'web' of notes with links (like references) between them is far more useful than a fixed hierarchical system. When describing a complex system, many people resort to diagrams with circles and arrows. Circles and arrows leave one free to describe the interrelationships between things in a way that tables, for example, do not. The system we need is like a diagram of circles and arrows, where circles and arrows can stand for anything." (source)

He was serious about the "anything" part, but we'll get back to that. As implemented, the circles came to represent documents and the arrows became references to other documents. Web pages linking to other web pages! The notion of document interlinking had been around for decades, but the World Wide Web turned the idea into practical, worldwide reality.

Linked documents sounds a little boring, but programmers have found ways to make web "documents" very interactive. Many other Internet applications have migrated into the web browser. Gopher was replaced by Yahoo (before Yahoo became a tabloid). Home users are more likely to use web mail than a standalone mail client. Twitter and Facebook have largely replaced IRC and other instant messaging clients. Web services (and web mail, unfortunately) are used to transfer files instead of FTP. It's a good thing that applications like Skype and BitTorrent exist, or people might forget there's a difference between the Internet and the World Wide Web!

What's next?

Many great things happened after we started linking documents; what if we try linking finer-grained pieces of data in usable ways? That's the idea behind the Semantic Web.

Think of it this way: the World Wide Web allowed organizations and individuals to put their relatively static documents "out there" for the world to see. But what about database generated content like library catalogs, or online store pricing, or current weather conditions? Web crawlers might be able to retrieve and usefully interpret some of this data, but that usually requires special per-site programming that breaks if the API or web formatting changes.

Getting Across Town, The Semantic Way

Here is an example of a web-published bus route:

http://lincoln.ne.gov/city/pworks/startran/routemap/weekday/route41.htm

An experienced bus rider can read this page and figure out how to plan a trip. A computer program would need help understanding how to parse all of this visually-structured data into precisely labeled information that it can reason about. Quick, what time does the last southbound bus leave "North Walmart" on Thursdays? It's not a trivial process to give that answer, even after we visually interpret the numbers as times in columns that correspond to bus stop locations on the map below. An even harder question might be: "I'm at arbitrary location X and want to reach location Y; what bus route gives me the shortest total walking distance?" In this case, a human on the right website might still have to manually look through all bus route pages, narrow it down to a couple of likely shortest routes, then spend more time comparing the tradeoff between walking farther to the first bus stop or walking farther from the last bus stop.

What would be really neat is a way for bus services and street map services to publish their data on the web in a computer-friendly form that allows third party web apps to combine all of this information and calculate answers to such questions. Even better: a universal format so mash-ups from unexpected combinations of data sources are easier to make. I'm thinking of a music app that checks your GPS position and your destination so it can create a playlist that ends within thirty seconds before your final stop. Or an emergency flight plan app that cross references ticket pricing options with weather predictions. Or a recipe web site that lets you mark missing ingredients and shows their pricing from the five closest stores. Or a personalized book recommendation site that filters by currently available titles in local public libraries. Or imagine searching the web for information on a brand-name drug, and the top results use the drug's generic name without mentioning the brand-name.

Many of these things are possible without semantic web technology; they just require more work to set up and don't tend to be very reusable. For example, Google Transit can help with bus route planning, if a city has formatted their data specifically for this Google web app and joined the transit partner program. But what if a new business wants to reuse this information in a creative way? What if Google cancels the Transit service? It would preferable to have an open standard for open data.

Linked Data

What's the plan, then? Open existing relational databases to the public? Not exactly. The World Wide Web Consortium is pushing for another database model that's a more natural fit for the web: a graph-style data model. From the Wikipedia article:

"Compared with relational databases, graph databases are often faster for associative data sets, and map more directly to the structure of object-oriented applications. They can scale more naturally to large data sets as they do not typically require expensive join operations. As they depend less on a rigid schema, they are more suitable to manage ad-hoc and changing data with evolving schemas. Conversely, relational databases are typically faster at performing the same operation on large numbers of data elements."

In other words, graph databases are less efficient but more flexible (see also The Death of the Relational Database). For people who aren't math majors or computer programmers, "graph database" may sound like "graphical database." But what's meant is graph theory: a bunch of nodes and connections between nodes, usually visualized as circles and lines. A directed graph adds direction to those lines, so you get circles and arrows. Recall what Tim Berners-Lee wrote in his original proposal for the World Wide Web: "The system we need is like a diagram of circles and arrows, where circles and arrows can stand for anything." The World Wide Web is made of connections like this:

(http://en.wikipedia.org/wiki/Cat) --links to--> (http://www.catpert.com/)

Each URL (Uniform Resource Locator) is a circle and web links are the arrows. If you can imagine all URLs and all arrows between them as a gigantic diagram, you're visualizing the World Wide Web as one big directed graph.

Now imagine that the circles can stand for anything, not just web documents. Imagine that the arrows can stand for any relationship, not just navigation links.

(rain gauge #2,388) --detected rain depth--> (3 cm)
(rain gauge #2,388) --time since last emptied--> (60 min)
(rain gauge #2,388) --location--> (Millennium Stadium)
(Cardiff) --contains--> (Millennium Stadium)

A web app that has access to this information can now give an answer the question, "How much has it rained in Cardiff in the last hour?" "An average of 3 cm, as reported by 1 rain gauge." Or with more gauges it might be, "An average of 2.95 cm, as reported by 15 rain gauges." These (something) --related somehow--> (something) snippets of information called triples can combine together into complex graphs of data. And, like web pages, this can happen across servers. The rain depth information could be on one server that only knows the gauge is in Millennium Stadium, while another server knows that Millennium Stadium is in Cardiff. In fact, it makes sense to reference a separate server with lots of geographical knowledge rather than trying to maintain geographical info on a specialized rain gauge server. If the geography server is updated, the rain server automatically and instantly benefits! This is an example of the synergy that can happen with linked data.

Wait, Where Are These Factoids?

Regular web links are in web pages and point to other web pages; we're used to that by now. But where are these triples located? They can be embedded into web page code in the form of RDFa. Graph databases called triplestores can also be put on the Internet and directly queried, much as a SQL database could be if it weren't hidden behind an intermediary website. In either case, typical Internet users won't "see" the Semantic Web directly as they see the World Wide Web's documents and links. The Semantic Web exists as a programming-oriented sibling or add-on to the World Wide Web, not as a replacement. Applications use the Semantic Web to enhance traditional web services.

What Makes the Semantic Web "Semantic"?

In philosophy, linguistics, and computer science, semantics has to do with meaning in contrast to syntax (which has to do with structure or format). Remember ad-libs?

The [adjective] outlaw [transitive past tense verb] a [common noun].

So long as these blanks are filled in with the specified parts of speech, the resulting sentence will be syntactically correct; it will have the right format for an English sentence. For example:

The lonely outlaw whistled a tune.
The law-abiding outlaw drank a mortgage.

The second sentence may have proper syntax, but it's nonsense. Because of their meaning, certain words and phrases don't go well together, at least not in a literal sense. Something else to consider:

This isn't a dog, it's a doberman pinscher.

Again, nothing wrong with the syntax, but a doberman pinscher is a type of dog. Another case:

There were witch trials in Salem.

The truth of this sentence depends (in part) on which Salem is meant. It's a true claim when referring to Salem, Massachusetts. It's false for Salem, Iowa...and many other Salems. In standalone databases, ambiguities and mis-matched concepts like these aren't much of a problem. A database created for a certain purpose in a certain context has implicit restrictions on the meaning of its data. A Massachusetts newspaper database and a Iowa newspaper database are going to mean something different by just plain "Salem." What happens if we try to publish all of these databases on the web and expect the data to mesh well together? Chaos, unintentional humor, and a general lack of usefulness!

For this reason, the Semantic Web has to be about more than just publishing everyone's data as (subject) --predicate--> (object) triples. Here's a flawed set of triples:

(witch trials) --took place in--> (Salem)
(Tom) --born in--> (Salem)

Was Tom born in the same city that the witch trials took place in? We can't tell because we don't know if the two "Salem"s are the same, or which "Tom" is meant. To solve this problem, URIs (Uniform Resource Identifiers) are used, roughly like this:

(http://dbpedia.org/resource/Category:Salem_witch_trials)
--(http://sw.opencyc.org/2008/06/10/concept/en/eventOccursAt)-->
(http://dbpedia.org/resource/Salem,_Massachusetts)

(http://dbpedia.org/page/Thomas_Poulter)
--(http://dbpedia.org/ontology/birthPlace)-->
(http://dbpedia.org/resource/Salem,_Iowa)

In this case, the "Tom" in question was born in a different Salem. If the URIs had matched up, it would have been possible to draw a new conclusion along the lines of (Tom) --born where occurred--> (Salem witch trials). Why call these URIs rather than URLs? Because they don't necessarily correspond to a visitable web page, although it's considered best practice to make such a page available when possible. A URI can identify a resource (or a concept!) without necessarily providing a location.

Did you notice that the URIs above come from both dbpedia.org and opencyc.org? There isn't a single, authorized web domain for the URIs used in linked data. Different organizations can contribute to the pool of URIs. What if two organizations use different URIs for the same thing? There's a triple for that!

(http://dbpedia.org/resource/Salem,_Massachusetts)
--(http://www.w3.org/2002/07/owl#sameAs)-->
(http://sw.cyc.com/concept/Mx4rvViiFpwpEbGdrcN5Y29ycA)

What about mismatches between URIs for "doberman pinscher" and "dog." As you might guess by now, a predicate (i.e. middle URI) can be used to say that a doberman is a type of dog. Then, hopefully, any computer program trying to decide if a given specimen is a dog won't stop at finding out that it's a "doberman pinscher"; it will check to see if doberman pinschers are dogs.

To answer the original question, what makes the Semantic Web "semantic"? All of this background work done by ontologists to separate and combine concepts and to specify the relationships among them. The Semantic Web isn't just about breaking data out of individual databases, but to publish data in terms of these shared vocabularies and relationship schemes. For data to be useful (and reusable) in a giant, global database, the information that was implicit in the context and structure of local databases has to become explicit. Triples format does this for structure. Ontology work does this for meaning.

When Will "Semantic Web" Be a Household Name?

It probably won't ever be a term everyone knows. The semantic revolution is happening behind the scenes among scientific, business, and cultural heritage groups. If things go well, the Semantic Web will increasingly influence the average person's experience with traditional web sites and services. Even if today's technical implementation of the Semantic Web remains niche, I have no doubt that some of its motivating ideas will reappear in future technologies.

Related Reading