Category Archives: Microformats

Priming the Microformat Aggregation Pump

The wonderful, liberating thing about microformats is that anyone, anywhere can author an instance. Whether you create by hand or use one of the ever-expanding set of tools, the fact that microformats are XHTML means that you can slip them — typically unnoticed by their host — into any system that accepts HTML. Planning an event? Simply drop an hCalendar instance into your blog, and voila!

Voila? Normally, that expression is accompanied by a rabit being pulled out of a hat. It suggests that something magic and satisfying is going to happen. Having gone to the slight extra trouble of describing my event using hCalendar, what further benefit do I derive?

Well, there is at least one “voila!” effect that you can enjoy from your hCalendar instance: someone reading your blog post where you have encoded an event using hCalendar can use Brian Suda’s cool X2V to transform that hCalendar item into an iCalendar item, and thus get it into your desktop calendar. Eric Meyer shows how.

That’s a cool private benefit, but I want more. And I believe there need to be more demonstrable, immediate, and compelling benefits before the virtuous cycle driving adoption of microformats becomes self-sustaining. We have yet to see something compelling for microformats like hCalendar, hCard, and hReview.

On the other hand, we have seen a compelling application for another microformat: relTags. By adding a little bit of markup to my blog post, my post appears on Technorati’s search results pages and tag pages in such a way that it’s much more likely to be seen by my targeted audience. I get more people reading my post, I get more comments, and that, dear reader, is worthy of a “voila!”

The key difference between these two examples is aggregation. Microformats allow the stealthy, distributed, deployment of semantically rich data, but there’s not much value to using them until someone is aggregating them, collecting them from the far reaches of the web into a large pile that can be searched, categorized, etc. As a consumer, I don’t want to have to go looking in a thousand blogs to find an hCalendar event that I can add to my desktop calendar.

But therein lies the rub. If you want to develop an application that aggregates a certain type of microformat from blog postings, you’re going to have to scan every blog posting. The vast majority of posts won’t contain what you’re looking for. And that would be fine if there weren’t so many posts to scan, with their number increasing exponentially. What kind of hardware and bandwidth do you need today to scan all new blog posts? I don’t know the answer to that question, but I think it’s safe to assume that if it’s reasonable now, it won’t be in 6 months. And the real kicker is that if you were to go looking for hCalendar events, today you’d likely find only dozens — if that many. What’s your cost per item aggregated? Much too high, I’d wager.

Andy Baio, the creator of, has already set up his service to produce hCalendar marked-up events based on information that users have manually entered into his database. Even though there’s no compelling reason for him to have done so (other than the private benefit described above), it was easy to do, so he did it. On the other hand, aggegating hCalendar events, which would be much more valuable, is something he’s waiting to do until hCalendar becomes more widely used. Why? Because it’s too hard and too expensive. But imagine if it were otherwise, and Andy added aggregation as an additional means for getting content into I drop an hCalendar event into my blog and it shows up, minutes later, as an event in Now that would be worthy of a “voila!” — exactly the kind of benefit that’s needed to drive adoption.

So the question I pose is this: in the face of rising number of blog posts, how do we reduce the cost of aggregation of microformats to enable more services to aggregate? One thought is this: deploy a service tied into ping-o-matic that scans all new blog postings looking for microformats of a variety of types. When it finds a post containing one of those microformats, it turns around and pings a list of clients who are interested in that microformat type. So, for example, I drop an hCalendar event in my blog, my blog pings ping-o-matic, which in turn pings the “Microformat Router,” which scans the content, sees that it contains an hCalendar, and then pings, which retrieves the content and creates a new entry in’s database of event listings. Voila!

Now, you’re probably wondering how this reduces the cost of aggregating content. True, someone still has to scan through all the content looking for microformats. But that only has to happen once; each new client application adds only marginal cost. Further, this is something that could easily be deployed by a company that’s already scanning through all new blog postings. Don’t make me name names.

Alternatively, this could be set up as an independent service, much like ping-o-matic, serving the common good. Licensing terms could be established for client companies that successfully make use of the aggregation service to subsidize its cost of operation. Regardless, the efficiency gains that would result would be eventually recoupable somehow, and the availability of the service would really allow Microformats to deliver on their ultimate promise of making content more useful and discoverable.

And that, too, would be worthy of a “Voila!” Any takers?


Writing Microformat Parsers

The embedded microformat example from my previous post got me thinking about different approaches to writing a parser to consume microformats.

A month ago, there were some comments about different parsing approaches here. In thinking about the embedded example above, it’s clear that different approaches can lead to different results (beyond, of course, differences in performance).

You might, for instance, take the approach of using a regular expression to identify a node that indicates the start of a target microformat as Tantek suggested. Once you’ve identified an XML node that contains a microformat, you still have a couple of options.

In one approach, you simply query against the subtree rooted by the identified start node looking for values that are interesting. This is simple code to write but potentially gets you the wrong result — in the embedded example above, for instance, your query might return the wrong “url.”

The other approach, which is a bit more complex because it requires implementing a state machine, actually traverses the subtree looking for interesting bits. If it gets to something it doesn’t understand (or, more likely, something it isn’t interested in), it keeps going. If it does find something that it’s interested in, it dives in to extract the relevant value.

This is where there’s a subtle interaction with the embedding example cited above. If the hCard embedded in the hCalendar microformat is bound to a known property of hCalendar (in the example, hCard is bound to hCalendar:location), then my parser will probably not get confused about the hCard:url property because it has enough state to know that it’s processing a known hCalendar property. Thus my hCalendar parser doesn’t really have to know much of anything about hCard, which is a bit of a relief.

If, however, the hCard is not bound to any of hCalendar’s properties — it’s merely inside it but not explicitly embedded in some known property of hCalendar — then I’ve got a potential problem. Either I have to know about hCard’s definition or I’m going to misinterpret hCard’s url as an hCalendar url.

But I wonder: why would someone embed an hCard inside an hCalendar without binding it (i.e., embedding) inside one of hCalendar’s properties? What would such embedding mean? If there’s no real reason to do this (because it doesn’t really mean anything) then the problem fairly evaporates, I think.

Is this right?


Head-Spinnin’ on MicroFormats

[Update: 6/6/2005]Corrected some syntactic mistakes in the embedded microformat example. Apologies to any who got caught in the crossfire.

Okay, discussion on my most recent post about MicroFormats has left me dazed and confused. Do I need to adjust my medication? Probably. But I list below the following points of bewilderment. Somebody, please help!

  • Tantek seems to be saying that this discussion is the misguided theoretical contemplation of microformat-bashing naysayers. I find this perplexing as everybody (your author humbly excluded pending the final adjustment of his meds; see above) involved in the discussion seems to be an intelligent, fairly ardent supporter of microformats trying to understand how to build systems around them today. As near as I can tell, we’re all in this dicussion to see how to clear the path for more rapid adoption of microformats. Part of that involves looking at areas where there might be stumbling blocks, not to highlight them as reasons not to proceed but to understand them as areas that require caution, and perhaps invent solutions. None of this strikes me as antithetical to progress, but perhaps I’m missing something (is it 1 red pill and 2 blue ones, or 2 red ones and 1 blue one? Aaargh.).
  • Brian further explicates some of the issues regarding “url” appearing as a property in both hCalendar and hCard. But I am again confused when he writes “I agree that the URL in vCard IS the same URL in iCal.” I just realized that part of my problem is that I’m not sure I understand the theoretical example we’ve been discussing, where an hCard is embedded inside an hCalendar instance. Are we assuming that the hCard encapsulates information about the location of the event (the famous Argent Hotel in San Francisco in the canonical example)? Or is it subject of the event (the Web 2.0 Conference)? I’ve been assuming that it’s the former, but I suppose it could be either. At any rate, there are (at least) two potential url’s (one for the hotel and one for the conference), so it’s not clear to me what Brian means when he says they’re the same. I’m biting the bullet and writing an example to illustrate.
      Web 2.0 Conference: 
      October 5;-
     at the
        Argent Hotel, San Francisco, CA

    I have a sneaking suspicion that the example above is not structured the way that others have been thinking about it. In constructing it, I started to feel that I was on shaky ground in using an hCard to represent the hotel. Is that inappropriate? The Technorati Wiki refers to hCard as a representation for people and companies; the vcard spec says it’s for representing a “white-pages person object.” I’m assuming that, despite the apparent vcard limitation of scope, that people will/do use vcard’s for contact information for companies in addition to people.

    Anyway, I’m hoping that in light of the example above, Brian will help me to understand what he means when he says that url in hCalendar and url in hCard are the same.

Finally, I wanted to thank Ryan for pointing me to Douglas Clifton’s DRX. Although I haven’t been able to see it yet (site appears to be down), it’s always helpful to see how others are using/intepreting these ideas. And props to Brian for pointing me at the various brainstorming pages — somehow I had missed those.


MicroFormats Continued

In response to my previous inflammatory post, I got a pair of good comments from Ryan King and Brian Suda. To recap, we were focusing on two areas where microformats might run into difficulty: inability to perform validation against a machine-readable profile, and namespace collisions.

Ryan suggests that microformat authoring applets are a good way to mitigate problems that might otherwise crop up due to lack of validation. If microformats aren’t being coded by hand, they’re more likely to be valid. Anybody following this debate has probably already seen these.

Brian points out that, even within the current handful of microformats, collisions in the property namespace are already a real problem. He shows that because hCard and hCalendar both use “url” as a property name (albeit in similar ways), someone parsing an hCalendar that happens to contain an embedded hCard inside is liable to misinterpret hCard’s “url” as belonging to hCalendar. Is there a way around this? Can I write a parser for a particular kind of microformat such that it can handle other embedded microformats that I’ve never seen before without choking if there’s a namespace collision?


MicroFormats: What’s their problem?

[Update 6/2/2005: Added tags]

In response to my previous post, Brian Suda provided some valuable commentary on the limitations of MicroFormats, especially as compared with RDF. Some of the discussion between us happened offline via email, but with Brian’s permission I am paraphrasing and summarizing here to get the discussion back into the public domain.

The purpose of this analysis was to gain an understanding of contexts in which using a MicroFormat might be successful as an easy-to-author, good-enough representation of structured data, and, in the same vein, understand situations in which using a MicroFormat would be an invitation to semantic disaster. The description of MicroFormats provides limited guidance here. Thus we begin probing the soft underbelly of MicroFormats:
Continue reading