Accent Folding for Auto-Complete

by Carlos Bueno

23 Reader Comments

Back to the Article
  1. It seems like facebook is actually implementing this in their search function, although I hadn’t noticed it before.

    If i try to find my friend named “Ã…sa” I have to write her full (albeit short) first name or else I won’t find her, the list populates with all names that begin with “A”. This is not a good way to solve the “problem” (exactly what is the problem anyway?).

    Where would you need to implement a solution such as this? Won’t the application just become less international/multi-lingual?

    Interesting concept nonetheless.

    Copy & paste the code below to embed this comment.
  2. I think it is important for anyone wanting to implement this, that you remember to sort by relevance. What I mean by that is, if you search for “Jø”, you list all results with the exact unicode match FIRST. Anything else should come after that (results matching “jo” for example).

    Otherwise it is a very interesting approach to a common problem.

    Copy & paste the code below to embed this comment.
  3. accent_map should be accentMap

    Copy & paste the code below to embed this comment.
  4. The commonly accepted solution to “accent folding” in Unicode is called stringprep and is documented in RFC-3454. It handles accents, uppercase/lowercase, and many other nasty details of various character sets.

    Various open standards make use of stringprep: SASL, XMPP, IDN, …

    I don’t know of any JavaScript stringprep implementations but creating one shouldn’t prove too difficult. Examples exist in many other languages.

    Copy & paste the code below to embed this comment.
  5. Isn’t what you call accent folding (first time I hear that expression) what is referred to as Unicode normalization?

    Copy & paste the code below to embed this comment.
  6. cbas, Ned: This technique has the same goal of Unicode Normalization, but is not anywhere near as correct. :D It has the virtue of being fairly easy to understand and implement.

    You are right though—I should have talked about normalization a bit and pointed to some of the libraries like Python’s unicodedata:

    http://docs.python.org/library/unicodedata.html

    Copy & paste the code below to embed this comment.
  7. I think it is worth pointing out that “accented characters” are not, in fact, only guides to pronunciation (or ornamentations for some other reason). Some of them are actually characters in their own right in one language or another.

    Therefore any technique like this is really only applicable to English and possible a few other languages unless the code gets a lot more complicated.

    I can only speak with any degree of knowledge about my own native language, Swedish, where our accented åäö are not a:s and o:s in the same way as an é is. Our alphabet does not end on z. It goes on to include these three extra characters.

    We have special keys on our keyboards for them and have generally no desire to have a and ä treated the same. We do however not have an é key and would probably be thankful if that character WAS treated equal to an e. Multiply this by the number of countries typing on latin keyboards and… See the complexity looming here? Each language and/or country is likely to need their own set of rules as to which characters to “fold” and not.

    Any application expecting multiple languages should take these things seriously… even it that turns out to mostly be Facebook, Twitter and Google. :)

    Copy & paste the code below to embed this comment.
  8. If you’d like to explore a functioning version of the simple example in Carlos’s article, I’ve put one up “here”:http://ericmiraglia.com/yui/demos/accentfolding.php .

    Copy & paste the code below to embed this comment.
  9. Teebz: That’s a tricky one. A good compromise would be to place exact matches above accent-folded / normalized ones.

    Martin: You are of course correct.  A real system should probably take into account the user’s locale, but paying attention to this complicates the implementation enormously.

    Copy & paste the code below to embed this comment.
  10. There’s a lot more on unicode normalization here: http://unicode.org/reports/tr15/

    And its also worth mentioning PHP has a Normalizer class as part of the intl extension (built into 5.3, or available as PECL extension for >=5.2.4 http://pecl.php.net/package/intl)

    http://www.php.net/manual/en/class.normalizer.php (the NFKC form is probably most relevant to accent folding)

    Copy & paste the code below to embed this comment.
  11. I am really not hugely convinced by the article’s blandishments that diacritics are generally disposable marks, except in rare cases when they aren’t. This seems like a rearticulation of the U.S. English model that real people don’t need accents and whatever minority of people who do are weird and expendable.

    The case explored in the article — searching for a proper name that may or may not use diacritics with a keyboard that may or may not have them — is surely frequently used, but deceptive. The almost offhand mention of searching for thé rather than the seems more central.

    Just using the same French there, it is shockingly untrue to state that an accent merely denotes pronunciation, since past tenses of verbs (danser → dansé; créer → créée [f.]) and homograph pairs (sucre:sucré; mais:maïs; de:dé; du:dû; a:à) are the norm, not exceptions.

    I’d say the actual default case is that diacritical marks are separate letters. Only in edge cases should they not be treated that way. This is the opposite of the article’s focus and of U.S. English default expectations. The article’s methods for treating that edge case are not necessarily wrong, though the emphasis strikes me as wrong or inverted.

    Jukka Korpela goes into some detail on this in Unicode Explained, readable on Google Books.

    Copy & paste the code below to embed this comment.
  12. Kudos to ALA for broaching this i18n-related topic, here’s hoping a lot more appear.

    It’s worth pointing out that normalization is actually not the same thing as diacritic folding, although it’s relevant. Check out the following for more detail:

    http://en.wikipedia.org/wiki/Unicode_equivalence

    Actually work on the details of this stuff is ongoing:

    http://www.w3.org/TR/charmod-norm/ ← Pretty gruesome, this.

    There’s also some Javascript (and PHP) code at i18n guru Richard Ishida’s site:

    http://rishida.net/blog/?p=222

    Copy & paste the code below to embed this comment.
  13. My co-worker Nikolay Bachiyski shared the following with me when I brought his attention to this excellent article.

    “This is also known as Unicode Normalization or Equivalence [1]. And there are good APIs for both Java [2] and PHP [3] (via an extension)

    1. http://en.wikipedia.org/wiki/Unicode_equivalence
    2. http://java.sun.com/javase/6/docs/api/java/text/Normalizer.html
    3. http://php.net/manual/en/class.normalizer.php

    Copy & paste the code below to embed this comment.
  14. For what its worth, I believe your hiragana example would be better presented without the spacing between “words”. When written normally it all runs together like so:

    こどもはてれびをみるのがすきです

    rather than:

    こども は てれび を みる の が すき です

    Copy & paste the code below to embed this comment.
  15. Good post”¦.thanks for sharing.very useful for me i will bookmark this for my future needed. thanks for a great source.

    Copy & paste the code below to embed this comment.
  16. I think doing things as described could actually cause more problems then it solves. If this where to be done, I think it would have to be language specific or your heading for disaster.

    In a language learning tool I am currently working on the approach I have taken is that if any accented character is entered to leave the string completely alone, unless I can’t turn up any results.  If there are no accented characters I then play smart and try to see if I can pull up words both with and without the accents using a technique similar to that described here, expect limited to those characters that are valid for the given language.

    Copy & paste the code below to embed this comment.
  17. While I generally agree with a lot of things you say, I have a few remarks concerning the way Wikipedia handles this issue.

    You wrote:

    • Wikipedia: Ryszard KapuÅ›ciÅ„ski (canonical URL)
    • Wikipedia: Ryszard Kapuscinski (hand-coded alternate)
    • Wikipedia: Ryszard KapusciÅ„ski (not found)
    • Wikipedia: Rÿszarḋ KÃ¥puÅ›ciÅ„sḳi (not found)

    Wikipedia articles have a unique name (what you call “canonical URL”) and multiple “redirects” (what you call “hand-coded alternate”), based on possible common misspellings. When you search “Ryszard KapusciÅ„ski”, you are not automatically redirected to the canonical URL, but the canonical article does appear as the first search result: http://en.wikipedia.org/w/index.php?title=Special:Search&search=Ryszard+KapusciÅ„ski&go=Go

    I think the reason for not redirecting automatically is that there may very well exist a “Ryszard KapusciÅ„ski” or a “Rÿszarḋ KÃ¥puÅ›ciÅ„sḳi”, different from “Ryszard KapuÅ›ciÅ„ski”, and the user needs to be able to create a new article about them without being automatically redirected by the software.

    (By the way, the fourth link labeled “Wikipedia” links to Spock).

    Copy & paste the code below to embed this comment.
  18. A real system should probably take into account the user’s locale, but paying attention to this complicates the implementation enormously.

    Invisalign Toronto

    Copy & paste the code below to embed this comment.
  19. This piece of code is great and especially useful when dealing with person names.
    However, this may lead to bad interpretation, loss of semantic sense if used on text longer than just surname/name.

    Copy & paste the code below to embed this comment.
  20. Your example of other RTL languages should be called Hebrew and not Yiddish.

    “Yiddish is a dialect of High German including some Hebrew and other words; spoken in Europe as a vernacular by many Jews; written in the Hebrew script” [source: http://wordnetweb.princeton.edu/perl/webwn?s=yiddish]

    Copy & paste the code below to embed this comment.
  21. herefore any technique like this is really only applicable to English and possible a few other languages unless the code gets a lot more complicated.

    TOEFL english
    CELTA certification
    esl tutoring toronto
    International House Toronto took its first registration in 1996 and has hosted over 2000 students from around the world. The school’s founding philosophy was to offer high quality English programs in a warm and comfortable atmosphere for our traveling students. Although we have grown in size, our philosophy has stayed the same.

    Copy & paste the code below to embed this comment.
  22. Right, FaceBook seems to try to implement this in search function

    Copy & paste the code below to embed this comment.
  23. I must admit this article makes me cringe. It is important when comparing texts to account for whether various characters and character sequences should be considered identical. I am glad that this article highlights this and give some examples of its importance.

    However, the area is more complex than discussed in the article and the suggestions made are in some cases dangerously wrong. The discussion of normalization is important since several sequences of character are identical alternative representations or may be depending on the rules one is following. Additionally, after this there is the question of comparison where, as mentioned, certain sequences are considered equivalent. However, these equivalence classes depend on the locale of the user, and the locale of the text data, also and desired usage (for example whether case insensitive).

    These definitions of equivalence are collation sequences. When ever comparing text one should not use a simple “string == string” idiom but something along the lines of “currentCollator = I18nLibrary.GetCollation; currentCollator.Compare(string, string);”.

    I would urge people to look at the Java documentation for java.text.Collator since that is one of the nicer starting reference. I do not know of a javascript implementation.

    Copy & paste the code below to embed this comment.