Accent Folding for Auto-Complete
Issue № 301

Accent Folding for Auto-Complete

Another generation of technology has passed and Unicode support is almost everywhere. The next step is to write software that is not just “internationalized” but truly multilingual. In this article we will skip through a bit of history and theory, then illustrate a neat hack called accent-folding. Accent-folding has its limitations but it can help make some important yet overlooked user interactions work better.

Article Continues Below

A common assumption about internationalization is that every user fits into a single locale like “English, United States” or “French, France.” It’s a hangover from the PC days when just getting the computer to display the right squiggly bits was a big deal. One byte equaled one character, no exceptions, and you could only load one language’s alphabet at a time. This was fine because it was better than nothing, and because users spent most of their time with documents they or their coworkers produced themselves.

Today users deal with data from everywhere, in multiple languages and locales, all the time. The locale I prefer is only loosely correlated with the locales I expect applications to process.

Consider this address book:

  • Fulanito López
  • Erik Lørgensen
  • Lorena Smith
  • James Lö

If I compose a new message and type “lo” in the To: field, what should happen? In many applications only Lorena will show up. These applications “support Unicode,” in the sense that they don’t corrupt or barf on it, but that’s all.

Screenshot of Microsoft Entourage's address book autosuggest, which does not fold accented characters.

Fig 1. Hey Entourage, where are my contacts?

This problem is not just in address books. Think about inboxes, social bookmarks, comment feeds, users who speak multiple languages, users in internet cafés in foreign countries, even URLs. Look at the journalist Ryszard Kapuściński and how different websites handle his name:

There is no excuse for your software to play dumb when the user types “cafe” instead of “café.”

Áçčềñṭ-Ḟøłɖǐṅg#section1

In specific applications of search that favor recall over precision, such as our address book example, á, a, å, and â can be treated as equivalent. Accents (a.k.a diacritical marks) are pronunciation hints that don’t affect the textual meaning. Entering them can be cumbersome, especially on mobile devices.

Example of a YUI Autosuggest widget with accent-folding added.

Fig 2. accent-folding in an autosuggest widget

An accent-folding function essentially maps Unicode characters to ASCII equivalents. Anywhere you apply case-folding, you should consider accent-folding, and for exactly the same reasons. With accent-folding, it doesn’t matter whether users search for cafe, café or even çåFé; the results will be the same.

Be aware that there are a million caveats to accent rules. You will almost certainly get it wrong for somebody, somewhere. Nearly every alphabet has a few extra-special marks that do affect meaning, and, of course, non-Western alphabets have completely different rules.

A minor gotcha is the Unicode “fullwidth” Roman alphabet. These are fixed-width versions of plain ASCII characters designed to line up with Chinese/Japanese/Korean characters (e.g., “1979年8月15日”). They reside in 0xff00 to 0xff5e and should be treated as equivalent to their ASCII counterparts.

Hey man, I’m only here for the copy/paste#section2

I’ve posted more complete examples on GitHub, but for illustration, here’s a basic accent-folder in Javascript:

(Line wraps marked » —Ed.)

var accentMap = {
  'á':'a', 'é':'e', 'í':'i','ó':'o','ú':'u'
};function accent_fold (s) {
  if (!s) { return ''; }
  var ret = '';
  for (var i = 0; i < s.length; i++) {
    ret += accent_map[s.charAt(i)] || s.charAt(i);
  }
  return ret;
};

Regular Expressions#section3

Regular expressions are very tricky to make accent-aware. Notice that in Fig. 2 only the unaccented entries are in bold type. The problem is that the Unicode character layout does not lend itself to patterns that cut across languages. The proper regex for “lo” would be something insane like:

[LlĹ弾ĻļḶḷḸḹḼḽḺḻŁłŁłĿŀȽƚɫ][OoÓóÒòŎŏÔôỐốỒồỖöȪȫŐőÕõṌȭȮȯǾǿ...ǬǭŌ]

Never, never do this. As of this writing, few regular expression engines support shortcuts for Unicode character classes. PCRE and Java seem to be in the vanguard. You probably shouldn’t push it. Instead, try highlighting an accent-folded version of the string, and then use those character positions to highlight the original, like so:

<b>// accent_folded_hilite("Fulanilo López", 'lo')
//   --> "Fulani<b>lo</b> <b>Ló</b>pez"
//
function accent_folded_hilite(str, q) {
  var str_folded = accent_fold(str).toLowerCase() »
  .replace(/[<>]+/g, '');
  var q_folded = accent_fold(q).toLowerCase() »
  .replace(/[<>]+/g, '');  // Create an intermediate string with hilite hints
  // Example: fulani<lo> <lo>pez
  var re = new RegExp(q_folded, 'g');
  var hilite_hints = str_folded.replace(re, »
  '<'+q_folded+'>');  // Index pointer for the original string
  var spos = 0;
  // Accumulator for our final string
  var highlighted = '';  // Walk down the original string and the hilite hint
  // string in parallel. When you encounter a < or > hint,
  // append the opening / closing tag in our final string.
  // If the current char is not a hint, append the equiv.
  // char from the original string to our final string and
  // advance the original string's pointer.
  for (var i = 0; i< hilite_hints.length; i++) {
    var c = str.charAt(spos);
    var h = hilite_hints.charAt(i);
    if (h === '<') {
      highlighted += '<b>';
    } else if (h === '>') {
      highlighted += '</b>';
    } else {
      spos += 1;
      highlighted += c;
    }
  }
  return highlighted;
}

The previous example is probably too simplistic for production code. You can’t highlight multiple terms, for example. Some special characters might expand to two characters, such as “æ” –> “ae” which will screw up spos. It also strips out angle-brackets (<>) in the original string. But it’s good enough for a first pass.

Accent-folding in YUI Autocomplete#section4

YUI’s Autocomplete library has many hooks and options to play with. Today we’ll look at two overrideable methods: filterResults() and formatMatch(). The filterResults method allows you to write your own matching function. The formatMatch method allows you to change the HTML of an entry in the list of suggested matches. You can also download a complete, working example with all of the source code.

(Line wraps marked » —Ed.)

<b><!-- this is important to tell javascript to treat
the strings as UTF-8 -->
<meta http-equiv="content-type" content="text/html;charset=utf-8" /><!-- YUI stylesheets -->
<link rel="stylesheet" type="text/css" 
 href="http://yui.yahooapis.com/2.7.0/build/fonts/fonts-min.css" />
<link rel="stylesheet" type="text/css" 
 href="http://yui.yahooapis.com/2.7.0/build/autocomplete/assets »
 /skins/sam/autocomplete.css" /><!-- YUI libraries: events, datasource and autocomplete -->
<script type="text/javascript" 
 src="http://yui.yahooapis.com/2.7.0/build/yahoo-dom-event »
 /yahoo-dom-event.js"></script>
<script type="text/javascript" 
 src="http://yui.yahooapis.com/2.7.0/build/datasource »
 /datasource-min.js"></script>
<script type="text/javascript" 
 src="http://yui.yahooapis.com/2.7.0/build/autocomplete »
 /autocomplete-min.js"></script><!-- contains accent_fold() and accent_folded_hilite() -->
<script type="text/javascript" src="accent-fold.js"></script><!-- Give <body> the YUI "skin" -->
<body class="yui-skin-sam">
  <b>To:</b> 
  <div style="width:25em">    <!-- Our to: field -->
    <input id="to" type="text" />    <!-- An empty <div> to contain the autocomplete -->
    <div id="ac"></div>  </div>
</body><script>// Our static address book as a list of hash tables
var addressBook = [
  {name:'Fulanito López', email:'fulanito@example.com'},
  {name:'Erik Lørgensen', email:'erik@example.com'},
  {name:'Lorena Smith',   email:'lorena@example.com'},
  {name:'James Lö',       email:'james@example.com'}
];/* 
Iterate our address book and add a new field to each 
row called "search." This contains an accent-folded 
version of the "name" field.
*/
for (var i = 0; i< addressBook.length; i++) {
  addressBook[i]['search'] = accent_fold(addressBook[i]['name']);
}// Create a YUI datasource object from our raw address book
var datasource = new YAHOO.util.LocalDataSource(addressBook);/*
A datasource is tabular, but our array of hash tables has no
concept of column order. So explicitly tell the datasource 
what order to put the columns in.
*/
datasource.responseSchema = {fields : ["email", "name", "search"]};/*
Instantiate the autocomplete widget with a reference to the 
input field, the empty div, and the datasource object.
*/
var autocomp = new YAHOO.widget.AutoComplete("to", "ac", datasource);// Allow multiple entries by specifying space
// and comma as delimiters
autocomp.delimChar = [","," "];/* 
Add a new filterResults() method to the autocomplete object:
Iterate over the datasource and search for q inside the 
"search" field. This method is called each time the user
types a new character into the input field.
*/
autocomp.filterResults = function(q, entries, resultObj, cb) {
    var matches = [];
    var re = new RegExp('\\b'+accent_fold(q), 'i');
    for (var i = 0; i < entries.length; i++) {
        if (re.test(entries[i]['search'])) {
            matches.push(entries[i]);
        }
    }
    resultObj.results = matches;
    return resultObj;
};/*
Add a new formatResult() method. It is called on each result
returned from filterResults(). It outputs a pretty HTML 
representation of the match. In this method we run the
accent-folded highlight function over the name and email.
*/
autocomp.formatResult = function (entry, q, match) {
    var name = accent_folded_hilite(entry[1], q);
    var email = accent_folded_hilite(entry[0], q);
    return name + ' <' + email + '>';
};//fin
</script>

About those million caveats…#section5

This accent-folding trick works primarily for Western European text, but it won’t work for all of it. It exploits specific quirks of the language family and the limited problem domains of our examples, where it’s better to get more results than no results. In German, Ü should probably map to Ue instead of just U. A French person searching the web for thé (tea) would be upset if flooded with irrelevant English text.

You can only push a simple character map so far. It would be very tricky to reconcile the Roman “Marc Chagall” with the Cyrillic “Марк Шагал” or Yiddish “מאַרק שאַגאַל.” There are very interesting similarities in the characters but a magical context-free two-way mapping is probably not possible.

On top of all that there is another problem: One language can have more than one writing system. Transliteration is writing a language in a different alphabet. It’s not quite the same as transcription, which maps sounds as in “hola, que paso” –> “oh-la, keh pah-so.” Transliterations try to map the written symbols to another alphabet, ideally in a way that’s reversible.

These four sentences all say “Children like to watch television” in Japanese:

  • Kanji: 子供はテレビを見るのが好きです。
  • Hiragana: こども は てれび を みる の が すき です 。
  • Romaji: kodomo wa terebi o miru noga suki desu.
  • Cyrillic: кодомо ва тэрэби о миру нога суки дэсу.

For centuries people have been inventing ways to write different languages with whatever keyboards or typesetting they had available. So even if the user reads only one language, they might do so in multiple transliteration schemes. Some schemes are logical and academic, but often they are messy organic things that depend on regional accent and historical cruft. The computer era kicked off a new explosion of systems as people learned to chat and send emails in plain ASCII.

There is a lot of prior work on this problem and two ready paths you can choose: The right way and the maybe-good-enough way. Neither have the simplicity of our naïve hash table, but they will be less disappointing for your users in general applications.

The first is International Components for Unicode (ICU), a project that originated in the early nineties in the European Union. It aims to be a complete, language-aware transliteration, Unicode, formatting, everything library. It’s big, it’s C++/Java, and it requires contextual knowledge of the inputs and outputs to work.

The second is Unidecode, a context-free transliteration library available for Perl and Python. It tries to transliterate all Unicode characters to a basic Latin alphabet. It makes little attempt to be reversible, language-specific, or even generally correct; it’s a quick-and-dirty hack that nonetheless is pretty comprehensive.

Accent-folding in the right places saves your users time and makes your software smarter. The strict segregation of languages and locales is partially an artifact of technology limitations which no longer hold. It’s up to you how far you want to take things, but even a little bit of effort goes a long way. Good luck!

About the Author

Carlos Bueno

Carlos Bueno is a software engineer with Xoopit / Photos for Yahoo! Mail. He writes occasionally about overlooked aspects of internationalization, performance, and security. He lives in San Francisco.

23 Reader Comments

  1. It seems like facebook is actually implementing this in their search function, although I hadn’t noticed it before.

    If i try to find my friend named “Ã…sa” I have to write her full (albeit short) first name or else I won’t find her, the list populates with all names that begin with “A”. This is not a good way to solve the “problem” (exactly what is the problem anyway?).

    Where would you need to implement a solution such as this? Won’t the application just become less international/multi-lingual?

    Interesting concept nonetheless.

  2. I think it is important for anyone wanting to implement this, that you remember to sort by relevance. What I mean by that is, if you search for “Jø”, you list all results with the exact unicode match FIRST. Anything else should come after that (results matching “jo” for example).

    Otherwise it is a very interesting approach to a common problem.

  3. The commonly accepted solution to “accent folding” in Unicode is called stringprep and is documented in RFC-3454. It handles accents, uppercase/lowercase, and many other nasty details of various character sets.

    Various open standards make use of stringprep: SASL, XMPP, IDN, …

    I don’t know of any JavaScript stringprep implementations but creating one shouldn’t prove too difficult. Examples exist in many other languages.

  4. Isn’t what you call accent folding (first time I hear that expression) what is referred to as Unicode normalization?

  5. @cbas, @Ned: This technique has the same goal of Unicode Normalization, but is not anywhere near as correct. 😀 It has the virtue of being fairly easy to understand and implement.

    You are right though — I should have talked about normalization a bit and pointed to some of the libraries like Python’s unicodedata:

    http://docs.python.org/library/unicodedata.html

  6. I think it is worth pointing out that “accented characters” are not, in fact, only guides to pronunciation (or ornamentations for some other reason). Some of them are actually characters in their own right in one language or another.

    Therefore any technique like this is really only applicable to English and possible a few other languages unless the code gets a lot more complicated.

    I can only speak with any degree of knowledge about my own native language, Swedish, where our accented åäö are not a:s and o:s in the same way as an é is. Our alphabet does not end on z. It goes on to include these three extra characters.

    We have special keys on our keyboards for them and have generally no desire to have a and ä treated the same. We do however not have an é key and would probably be thankful if that character WAS treated equal to an e. Multiply this by the number of countries typing on latin keyboards and… See the complexity looming here? Each language and/or country is likely to need their own set of rules as to which characters to “fold” and not.

    Any application expecting multiple languages should take these things seriously… even it that turns out to mostly be Facebook, Twitter and Google. 🙂

  7. Teebz: That’s a tricky one. A good compromise would be to place exact matches above accent-folded / normalized ones.

    Martin: You are of course correct. A real system should probably take into account the user’s locale, but paying attention to this complicates the implementation enormously.

  8. I am really not hugely convinced by the article’s blandishments that diacritics are generally disposable marks, except in rare cases when they aren’t. This seems like a rearticulation of the U.S. English model that real people don’t need accents and whatever minority of people who do are weird and expendable.

    The case explored in the article — searching for a proper name that may or may not use diacritics with a keyboard that may or may not have them — is surely frequently used, but deceptive. The almost offhand mention of searching for _thé_ rather than _the_ seems more central.

    Just using the same French there, it is shockingly untrue to state that an accent merely denotes pronunciation, since past tenses of verbs (danser → dansé; créer → créée [f.]) and homograph pairs (sucre:sucré; mais:maïs; de:dé; du:dû; a:à) are the norm, not exceptions.

    I’d say the actual default case is that diacritical marks are separate letters. Only in edge cases should they not be treated that way. This is the _opposite_ of the article’s focus and of U.S. English default expectations. The article’s methods for treating that edge case are not necessarily wrong, though the emphasis strikes me as wrong or inverted.

    Jukka Korpela goes into some detail on this in _Unicode Explained_, readable on Google Books.

  9. Kudos to ALA for broaching this i18n-related topic, here’s hoping a lot more appear.

    It’s worth pointing out that normalization is actually not the same thing as diacritic folding, although it’s relevant. Check out the following for more detail:

    http://en.wikipedia.org/wiki/Unicode_equivalence

    Actually work on the details of this stuff is ongoing:

    http://www.w3.org/TR/charmod-norm/ ← Pretty gruesome, this.

    There’s also some Javascript (and PHP) code at i18n guru Richard Ishida’s site:

    http://rishida.net/blog/?p=222

  10. For what its worth, I believe your hiragana example would be better presented without the spacing between “words”. When written normally it all runs together like so:

    こどもはてれびをみるのがすきです

    rather than:

    こども は てれび を みる の が すき です

  11. I think doing things as described could actually cause more problems then it solves. If this where to be done, I think it would have to be language specific or your heading for disaster.

    In a language learning tool I am currently working on the approach I have taken is that if any accented character is entered to leave the string completely alone, unless I can’t turn up any results. If there are no accented characters I then play smart and try to see if I can pull up words both with and without the accents using a technique similar to that described here, expect limited to those characters that are valid for the given language.

  12. While I generally agree with a lot of things you say, I have a few remarks concerning the way Wikipedia handles this issue.

    You wrote:
    * Wikipedia: Ryszard Kapuściński (canonical URL)
    * Wikipedia: Ryszard Kapuscinski (hand-coded alternate)
    * Wikipedia: Ryszard Kapusciński (not found)
    * Wikipedia: Rÿszarḋ Kåpuścińsḳi (not found)

    Wikipedia articles have a unique name (what you call “canonical URL”) and multiple “redirects” (what you call “hand-coded alternate”), based on possible common misspellings. When you search “Ryszard KapusciÅ„ski”, you are not automatically redirected to the canonical URL, but the canonical article does appear as the first search result: http://en.wikipedia.org/w/index.php?title=Special:Search&search=Ryszard+KapusciÅ„ski&go=Go

    I think the reason for not redirecting automatically is that there may very well exist a “Ryszard KapusciÅ„ski” or a “Rÿszarḋ KÃ¥puÅ›ciÅ„sḳi”, different from “Ryszard KapuÅ›ciÅ„ski”, and the user needs to be able to create a new article about them without being automatically redirected by the software.

    (By the way, the fourth link labeled “Wikipedia” links to Spock).

  13. This piece of code is great and especially useful when dealing with person names.
    However, this may lead to bad interpretation, loss of semantic sense if used on text longer than just surname/name.

  14. herefore any technique like this is really only applicable to English and possible a few other languages unless the code gets a lot more complicated.

    TOEFL english
    CELTA certification
    esl tutoring toronto
    International House Toronto took its first registration in 1996 and has hosted over 2000 students from around the world. The school’s founding philosophy was to offer high quality English programs in a warm and comfortable atmosphere for our traveling students. Although we have grown in size, our philosophy has stayed the same.

  15. I must admit this article makes me cringe. It is important when comparing texts to account for whether various characters and character sequences should be considered identical. I am glad that this article highlights this and give some examples of its importance.

    However, the area is more complex than discussed in the article and the suggestions made are in some cases dangerously wrong. The discussion of normalization is important since several sequences of character are identical alternative representations or may be depending on the rules one is following. Additionally, after this there is the question of comparison where, as mentioned, certain sequences are considered equivalent. However, these equivalence classes depend on the locale of the user, and the locale of the text data, also and desired usage (for example whether case insensitive).

    These definitions of equivalence are collation sequences. When ever comparing text one should not use a simple “string == string” idiom but something along the lines of “currentCollator = I18nLibrary.GetCollation; currentCollator.Compare(string, string);”.

    I would urge people to look at the Java documentation for java.text.Collator since that is one of the nicer starting reference. I do not know of a javascript implementation.

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA

Nothing Fails Like Success

Our own @zeldman paints the complicated catch-22 that our free, democratized web has with our money-making capitalist roots. As creators, how do we untangle this web? #LetsFixThis