Google Maps is a great service for developers and end users. There is just one thing that bothers me when I use Google Maps: try searching for cities like Cpenhagen or Mnchester. No results, no helpful suggestions for correction during typing at all. Look at those cities’ names again. There is only one character omitted and Google Maps leaves user with no help whatsoever. How unforgiving.
Typing errors are not rare. Even the professional typists’ error rate is 1%. And this is per-keystroke not per typed word. Typing error rate for Google Maps users is not published but I’d wager that it is much higher – maybe even at 5% per-keystroke range.
To improve usability of Google Maps I’d wish an autocomplete function that is:
- robust: typing errors in the user’s input are anticipated and tolerated;
- helpful to the user: with high probability, suggested corrections contains intended input; other suggestions provided are also highly probable given user’s input;
- efficient: the number of cities is in the order of millions and autocomplete must return results in real time.
This is what I built using a publicly available database of 2.7 million cities names. Note that some city names in the database are translated to English (Москва -> Moscow), transliterated (Coruña -> Coruna), or transcribed ( جدة -> Jeddah). (Mis)type a city name (e.g. “cpenh”) in the input box below and compare the suggestions to Google Maps’. Google Maps suggestions may include street and company names not available in our
Besides Google Maps which deals with a large set of cities autocomplete might be helpful to other websites that also require users to search or select from a set of options with non-trivial size. Usability of lookups can be improved using an autocomplete function that is robust to typing errors.
A bit of theory
The autocomplete function takes two inputs: user query q which is simply a sequence of keystrokes typed by user and a dictionary W of words representing valid suggestions (autocomplete options).
In our case dictionary W consists of names of cities and q is the user’s input to Google Maps search input form.
The probability of a suggestion w from W given user query q (denoted as P(w | q)) can be calculated using Bayes theorem:
and the autocomplete function should suggest the most probable word after the user typed query q:
as P(q) is constant for all candidate words w from W the expression for finding most probable suggestion for user typed query q simplifies to
So, word w is selected as a suggestion given user query q if the product of the two factors above is maximal among words the dictionary W.
P(w) is called prior probability and denotes the probability that suggestion should be w even before we know what the entered query q was. The prior probability of w can be estimated from the frequency of searches on Google Maps. As this information was not available to me I needed to use something else instead. Probing various cities popularity in Google Trends shows that larger cities are usually searched for more often than smaller cities. Therefore, assigned prior probability P(w) that city w will be suggested is set to be proportional to the logarithm of city size.
P(q | w), the other factor which determines the probability of choosing w as a suggestion denotes the probability that user types query q when she is searching for city w. P(‘Paris’ | ‘Paris’) denotes the probability of no typing errors when user searches for Paris. This probability is high for good typist. However, other queries like ”Pais’ are also probable: P(‘Pais’ | ‘Paris’) is high since query ‘Pais’ only differs from suggestion ‘Paris’ for one missing character. The situation is similar in the case of P(‘Pars’ | ‘Paris’), P(‘Prais’ | ‘Paris’), and many other mistyped queries similar to Paris.
Estimation of P(q | w) is based on the notion of the similarity of q and w. The more similar query q and candidate suggestion w are the higher the probability P(q | w).
We can transform query q in a candidate suggestion w in a series of one character transformations: insertion of a character, deletion of a character, substitution of a character, and transposition of two adjacent characters. Query ‘Pasi’ can be transformed into Paris by insertion of the character ‘r’ in the third position of the query followed by the transposition of characters ‘s’ and ‘i’. All transformations are not equally probable (e.g. near misses of the keyboard keys are more likely that far misses; transpositions are usually at the medial characters etc.). If we weight each transformation on a path from q to w by the corresponding probability we can define P(q | w) as the product of the probabilities on the most probable path from q to w.
Brute force generation of candidate suggestions is not feasible as the size of the dictionary is large; in addition, there are many candidate paths from query q to a suggestion w.
Best-first search can be used for searching for the best suggestion. Suppose we store all words in the dictionary in a trie. The leaves of the trie represent dictionary words. We assign P(w) as the probability of a leaf w. Interior trie nodes represent dictionary words prefixes. The probability of an interior node of a trie which represents word prefix w* is set to
We start best-first search in state <q, w, 1.0>. The last parameter is the probability assigned to the state. In every state all single character transformations of q are generated. We can immediately eliminate all transformations that are not compatible with dictionary (e.g. if the dictionary contains color names and query starts with the character x we can deduce that transposition is not possible in the first query position as there is no color with character ‘x’ at the second position). Every valid transformation results in a new state <q’, w’, P (parent(q’, w’)) * P(<q, w> -> <q’, w’>) * P(w’)> where P(a->b) denotes the transformation probability from state a to b and P(w’) the prior probability of prefix w’. Search ends when state (”, ”, p* = P(q | w) P(w) ) is reached or we run out of candidates. Path from the root of the trie to the leaf represents the most probable suggestion w for query q.
After the first suggestion is found we can continue searching for the next suggestion w’. The search terminates when P(q | w’) P(w’) < *p/100. This is somehow arbitrary criterion is used to weed out non-probable suggestion alternatives.
Dictionary of cities consists of 2.7 million entries. Cities names are either transliterated in Latin characters (Coruña -> Coruna), translated to English (Москва -> Moscow), or transcribed ( جدة -> Jeddah).
In case of Google Maps it seems reasonable to assume that there are more queries against larger cities: searching for New York (pop. 8.2 mio) is probably much more common than searching for New Kent (pop. 18K). Despite exceptions to this rule (e.g. tourist destinations attracting more searches compared to their population size), assigning prior probability of a city suggestion proportional to the city population seems a reasonable choice. Formally:
Population information for many cities in the database are missing. In such a case population is set to be 10, resulting in factor 1 in the numerator.
Typing errors have been extensively studied. Most typing errors tend to be simple insertions, substitutions and transpositions of characters. Typing errors are less likely at the beginning of the string and transpositions usually occur in medial characters.
General per-keystroke probability of the error rate is set to 5%, further distributed over: insertion errors (60%), deletion errors (16%), substitution errors (16%), and transposition errors (6%). Substitution and insertion probability error mass is evenly distributed among all candidate transformations (e.g. if there are 10 possible insertions every insertion gets 1/10th of the insertion probability mass). Deletion error in the first query position is multiplicatively penalized with factor 0.05, at the second position with factor 0.1. Insertion and substitution at the first query position are multiplicatively penalized with factors .05 and .1, respectively.
Insertion and transposition probabilities take into account keyboard key distances since near key insertions and transpositions are more likely. Current keyboard distances computation is based on the south Slavic querty keyboard but can be easily adapted to other keyboard types.
Naive generation of all possible successor states can result in a large number of states explored. This can be alleviated by
- fixing order of generation of successor states,
- dividing successors into three sets (left candidates interval, best successor node, right candidates interval),
- setting interval probability as the best interval member probability,
- setting generation order in such a way that intervals are typically large (first evaluate insertions, then (possible) match, then substitutions, character deletion, and transposition).
Taking into account the size of the dictionary and the ambiguity of the query the number of expanded nodes is quite reasonable:
|Lis Agne||2,402||Los Angeles, Los Angelitos, Les Agnels, Les Agneliers, Les Agneliers Bas|
|nw yr||1,098||New York, Nwoya, Nweyon, Na Yan, Nwarem|
|cpenh||574||Copenhagen, Cepni, Spencer, Cienega, Cuenha|
Autocomplete is implemented in C++. Use it as follows
TAutocomplete ac; ac.load("cities.txt"); // load dictionary once vector suggestions; // autocomplete suggestions ac.autocomplete("cpenh", suggestions); // find suggestions for input // query "cpenh"; max number // of sugegstions (default 5) // can be set as the last // parameter of the method
Standalone C++ sources for autocomplete and Mongoose web server port is available at GitHub.