A List Apart

Menu
Discovering Magic Issue № 293

Discovering Magic

by Published in JavaScript10 Comments

Most of us create identities across the web without much conscious thought. We fill in profiles, upload photos, videos, reviews, and bookmarks. Although this information is often public, it’s fragmented into the silos of individual websites. Wouldn’t it be a little magical if, when you signed up for a new site, the site said something like, “We notice you have a profile photo on Flickr and Twitter, would you like to use one of those or upload a new one?” I built a JavaScript library that can help you do just that. Ident Engine discovers and retrieves distributed identities and user-generated content to help you build a little magic into your user interfaces.

Try it out! Enter your profile URLs into the lifestream and combined profile demos. Were you shocked by the level of detail it found out about you? Let me show you how it works.

Our footprints across the web

Social media sites encourage us to have more open and transparent conversations, and create opportunities for new people to participate in our lives. We often see these distributed identities in the “elsewhere on the web” design pattern, in which people list their social media site profiles like this:

    
Elsewhere on the web:
    
    
Places you can find me:
    

Most of these identities are tied into social media sites where we create content. Most web identities are distributed—built from a mesh of interlinked profiles, each of which contains a wealth of information. These identities are our digital footprints across the web. APIs that use the semantic web and open data formats are at the heart of the architecture that brings all our footprints together.

The semantic web and open format data

The semantic web attempts to make information that is currently only intelligible to humans machine readable. Microformats are the semantic web technology that currently shows the most practical promise. Microformats are small patterns of class names and attributes you can add to HTML to help define blocks of reusable semantic data. There is a lot more information on the web formatted with microformats than most people realize: Yahoo’s SearchMonkey recently found 1.35 billion profiles marked up with the microformat hCard. Many other microformats are used to mark up other kinds of user-generated content, from lengthy reviews to tiny Twitter updates.

Beyond microformats, other open data formats such RDF, RSS, and ATOM contain rich data we can use.

Most of our web footprints are encapsulated in open-standard data formats. To exploit the value of these fragments, all we need is a machine-readable method to link the fragments to their common owner. This is a job for XFN (XHTML Friends Network).

The power of rel=“me” and Google’s Social Graph API

XFN is a simple and powerful microformat. We can use XFN to define an interlinking set of web pages that together, represent an individual. To achieve this, simply add the rel=“me” attribute to any link between two web pages that represent the same person. For example, I could use the markup below to define a relationship between any page on the web that represented me and my Twitter profile:
 

rel="me" href="http://twitter.com/glennjones">Twitter

Just over a year ago, Google released the Social Graph API, which allows anyone to query these relationships. After you provide a URL starting point, it returns a map (social graph) of all the related pages (edges) connected by rel=“me” links. Using this API, you can discover the numerous identities people have across the web. Try out the identity discovery demo. You’ll find that results vary depending on how many social media sites an individual uses and how well those sites are interlinked. Sometimes, changing the starting point returns different results.

The two ways your identities are found

There are two different ways to retrieve information from the Social Graph API using its rel=“me” relationships. The simplest way is otherme, which returns a list of sites based on rel=“me”. Try this example: http://socialgraph.apis.google.com/otherme?q=http://www.glennjones.net/&pretty=1

The second and more low level Social Graph API call, lookup allows you to control the inclusion of inward and outward linking. It also returns more complex link relationship views. The two main parameters of this call are edo (edges outward) and edi (edges inward).
 
Here’s a look at the output of a lookup call for glennjones.net: http://socialgraph.apis.google.com/lookup?q=http://www.glennjones.net/&fme=1&edi=0&edo=1&pretty=1&jme=1

If we executed an API call against the social graph in the diagram below to find only the outward relationships (i.e., edi=0&edo=1), it would return my blog, my Last.fm profile, my Google profile, and my Flickr pages. Whereas if we executed an API call to find both the outward and inward relationships (i.e., edi=0&edo=1), the Social Graph API would also find my Brightkite page.
 
Google’s testparse is a useful debugging method that displays the relationship links Google finds in a given piece of HTML.

Node linking diagram

Fig. 1. Node linking diagram

Of the two methods to query the social graph, the first is to start at point A (the first URL) and follow all the outward links to other pages (edges) using rel=“me”. In the diagram, A links to B and then B links to all the C pages. The links from A to B to C are called a chain.

Only search engines, such as Google, can use the second method, which is to search for any inward rel=“me” links from other webpages (edges). For example, point X, (my Brightkite account) has a rel=“me” link to my blog, but there’s no way to find point X following links from point A. You must do a search of every page to find point X.
 
The otherme API call only makes use of outward relationships. It returns reliable verified linkages, but often, the results are limited.

Imposters and rogue relationships

These methods of extrapolating relationship data are always always open to errors and possible abuse. If you use only outward linking (i.e., the otherme API call), the results are usually solid. This is because the same individual should own each of the pages in the chain, so the process is hard to hijack. Unfortunately, the otherme method does not return as many results as it would in combination with inward claims.

But you can’t really trust inward claims. Any imposter can include a rel=“me” link on a page to hijack someone else’s social graph. People can copy and paste HTML containing semantic markup into the wrong context by mistake, creating rogue relationship links. With some careful post-processing, however, you can use inward-linking data if you’re willing to accept the odd error and rogue relationship in exchange for a fuller set of results.

The decision to use inward claims should be based on the type of interface you are building and your audience’s expectations. If you need to play it safe, use only the otherme method call. Over time, people will create stronger linkages between their identities and using inward claims will become unnecessary.

Beyond listing identities: social graph node mapping

The Social Graph API returns interlinked page URLs and can also provide more detailed information from some social media sites. Using a technique called “social graph node mapping,” it finds useful URLs related to the same individual on a site. For certain accounts, it can also bring back some small bits of data such as the user’s full name.
 
To extract this second tier of information, the API must create a canonical user account called SGN (Social Graph Normalized URL). For example, using an SGN, my account on Flickr is expressed as: sgn://flickr.com/?ident=glennjonesnet. This domain/username pairing can be used to find other URLs (service endpoints). Using the node mapping technique, the results for my account on Flickr look like this:
 

"sgn://flickr.com/?ident=glennjonesnet": {
 "attributes": {
 "url": "http://www.flickr.com/photos/glennjonesnet/",
 "profile": "http://www.flickr.com/people/glennjonesnet/",
 "rss":http://api.flickr.com/services/...",
 "atom":"http://api.flickr.com/services/...",
},

Enhancing the discovery process

To enhance SGN, I added my own custom data to the API’s output. This allows me to describe many more sites and service endpoints, and—more importantly—to programmatically retrieve content from these sites.

                                                     
The custom data for a single Flickr service    endpoint description
uri-templatehttp://www.flickr.com/people/murtaugh/
media-typeHtml
schemahCard
content-typeProfile

Using this data, we can cross-reference the accounts the Social Graph API finds to identify specific data we want to retrieve. With API endpoint discovery, we can target specific content types from across an individual’s accounts. To enhance the identity data we have so far, we collect all the available profiles from across the web. Try out the profiles demo to learn more.

Although the hCard format is comprehensive, it’s often used to hold only two pieces of information, such as name and URL.

It makes sense to aggregate our collection of profiles into a single, combined profile. The aggregation rules Ident Engine uses favor completeness, and thus favor business-related data sources over personal data sources. For example, the address is chosen by the number of data elements it contains. Some values such as name, username, and URL are based on the most commonly used value across all the profiles. Try out the combined profile demo to learn more.

Blending profile and social graph data

You can use the profiles to to help extend your social graph. Profiles marked up in hCard can contain multiple URLs. Most social media sites mark these URLs with rel=“me”, but some don’t. Sometimes you can discover new profiles by checking the URLs listed. Ident Engine also aggregates the data to find the top usernames and the primary URL defined across all the profiles. This data helps to resolve ambiguity issues, such as determining which hCard is the representative hCard on any given page.

Parsing profiles—you’ve got choices

There are several different microformat parsers you can use to parse hCard profiles. Currently, Ident Engine uses two: Yahoo’s YQL and my own .Net parser, UfXtract. YQL parses microformats directly from Yahoo’s search index, and provides a fast, reliable URL-based API. Its index does not include all the pages on the web, but when you need responsiveness and scalability, YQL is the best choice. It can return XML or JSON. Here’s an example query using YQL microformats:

select * from microformats 
 where url = 'http://www.glennjones.net/about/'

http://query.yahooapis.com/v1/public/yql?q=select * from microformats where url=‘http:%2
F/www.glennjones.net/about/’&format=xml

 
The UfXtract parser is more compliant to microformat specifications and parses complex microformat markup. However, the UfXtract API is better suited to personal hacks and projects than to large-scale commercial use. It too can return XML or JSON. Here’s an example UfXtract API call:
 

http://ufxtract.com/api/?url=http://www.glennjones.net/about/&format=hCard&output=xml

 
Other good parsers include Optimus, hKit, and Swignition.
 

Retrieving user-generated content

Profiles are not the only type of content API endpoints can describe. We can also target other open standard data sources. In fact, it’s possible to build a fairly complete lifestreaming application from these sources. Try out the lifestream demo to see this idea in action.

The lifestream demo content is parsed from RSS, Atom, or microformats. Although there are many server-side libraries for parsing RSS and Atom feeds when working client-side, the YQL API can successfully parse most feeds and convert them into a single XML or JSON format. Here’s an example YQL feed query:

(Line wraps marked » —Ed.)

select * from feed 
 where url='http://api.flickr.com/services/feeds/ »
 photos_public.gne?id=77314934@N00'

http://query.yahooapis.com/v1/public/yql?q=select * from feed where url=‘http://api.flickr.com
/services/feeds/photos_public.gne?id=77314934@N00’&format=xml

 

The API can return additional information about embedded rich media resources, such as images or audio. You can encapsulate all types of content in feeds, so Ident Engine’s overlay data has a “content-type” property for each API endpoint. This describes the content as an event, activity stream, status, or video, etc. With this information, you can present content in a reasonable context.
 
Using feeds does not preclude using microformats as a method of collecting content. There are many circumstances in which microformats have advantages over feeds. For example, I can use the hAtom microformat to extract Twitter statuses, but their equivalent RSS feeds are not publicly accessible.

 
  http://ufxtract.com/api/?url=http://twitter.com/glennjones&format=hAtom&output=xml

 

Privacy

It can be very disconcerting for users to confront the data you’ve discovered using these methods. They may even be unaware that they’d shared the information publicly. It’s important to educate web users about how they present themselves, and about which content they choose to make public.

Tools such as the social graph API and data formats such as microformats don’t degrade privacy—on the web, there is no real privacy in obscurity—but as designers and developers, we need to find better ways to encourage users to make informed decisions about privacy. Most social media site approaches to privacy (and persona projection) are inadequate and unsophisticated compared to the ways in which we deal with privacy in the real world.
To create trust when designing interfaces that collect personal profile data, we must be as transparent as possible, and provide links to the original sources. If users find data that is wrong or outdated, they should have access to an easy method of tracking down that information so they can correct it.
 

Consider your audience

The methods we’ve examined work best for certain audiences under specific circumstances—for example, users of sites or applications that provide services to other social media sites. In such cases, you’ll often ask for account details as part of the first interaction with new users, and your audience is by definition actively participating in self publishing.

Use progressive enhancement

Be sure to consider progressive enhancement when you design interface functionality around this type of information discovery. Never rely completely on the data quantity or quality returned. Instead, design your discovery features to supplement the main task flow. For example, if you design a practical photo picker, provide a file upload as a starting point, and offer enhanced discovery as an optional extra.

All wrapped up for you: Ident Engine

We’ve covered how to join social graph and profile data, but a few secondary problems remain. Rather than going into greater detail, I created a JavaScript library, and I invite the more technically curious among you to pull it to pieces to find out how all the cogs fit together. For those less technically curious among you, I hope it will be relatively easy to use.
 
The library uses jQuery, so initiating a search is simple. First, you need to bind a function to render your results to the library. In the code below I bound a call to the renderListing function into any update events fired by Ident Engine. The library will fire an update every time it finds new data during the search process.
 

jQuery(document).ready(function () {
  doc.bind('ident:update', renderListing);
  ident.useInwardEdges = true;
  var url = 'http://twitter.com/glennjones
  ident.search(url);
}); 

The render function first clears any previous content. Then, it loops around the array of objects, in this case, to render any profile URLs found.
 

function renderListing(e){
  resetContent();
  var ul = jQuery('
    ').appendTo("˜results'); for (var x = 0; x < ident.indenties.length; x++) { var profileUrl = ident.identities[x].profileUrl; jQuery('
  • ' + profileUrl + '
  • ').appendTo(ul); } }
function resetContent(){
    jQuery('#results').html('');
}

Parsing user content with Ident Engine

There are several ways to extract user-generated content from the API endpoints Ident Engine discovers. The following two methods will load my status from Identica using the hAtom microformat. The findContent method can only be used after a search, while the load content works independently of the search.
ident.findContent(
  'identi.ca', 
  'Status', 
  'hAtom' 
);
   
ident.loadContent( 
  'http://identi.ca/glennjones', 
  'identi.ca', 
  'Identica', 
  'Status', 
  'hAtom' 
);
The returned data is stored in seven collections: Identity, Profile, Resume, Entry, Event, XFN, and Tag. As with profiles, an event fires to tell you new content has been added, once it’s collected from the APIs. You can download the source code, along with documentation and examples, from http://identengine.com/. It’s under an MIT open source licence.  

Discover the possibilities

The amount of data available on the web is directly tied to social trends in openness and self publishing. With billions of microformats embedded in the web and RDF growing in strength, it’s now possible to build applications on this data. Now it’s your turn to use identity discovery to build a little magic into the user experiences you design.

About the Author

10 Reader Comments

Load Comments