Improved autocomplete function for Google Maps


Implemented search autocomplete function handles user typing errors gracefully.
See how it compares to Google Maps autocomplete.
If you find it useful grab sources on GitHub.

Google Maps is a great service for developers and end users. There is just one thing that bothers me when I use Google Maps: try searching for cities like Cpenhagen or Mnchester. No results, no helpful suggestions for correction during typing at all. Look at those cities’ names again. There is only one character omitted and Google Maps leaves user with no help whatsoever. How unforgiving.

Typing errors are not rare. Even the professional typists’ error rate is 1%. And this is per-keystroke not per typed word. Typing error rate for Google Maps users is not published but I’d wager that it is much higher – maybe even at 5% per-keystroke range.

To improve usability of Google Maps I’d wish an autocomplete function that is:

  • robust: typing errors in the user’s input are anticipated and tolerated;
  • helpful to the user: with high probability, suggested corrections contains intended input; other suggestions provided are also highly probable given user’s input;
  • efficient: the number of cities is in the order of millions and autocomplete must return results in real time.


This is what I built using a publicly available database of 2.7 million cities names. Note that some city names in the database are translated to English (Москва -> Moscow), transliterated (Coruña -> Coruna), or transcribed ( جدة -> Jeddah). (Mis)type a city name (e.g. “cpenh”) in the input box below and compare the suggestions to Google Maps’. Google Maps suggestions may include street and company names not available in our

Besides Google Maps which deals with a large set of cities autocomplete might be helpful to other websites that also require users to search or select from a set of options with non-trivial size. Usability of lookups can be improved using an autocomplete function that is robust to typing errors.

A bit of theory

The autocomplete function takes two inputs: user query q which is simply a sequence of keystrokes typed by user and a dictionary W of words representing valid suggestions (autocomplete options).

In our case dictionary W consists of names of cities and q is the user’s input to Google Maps search input form.

The probability of a suggestion w from W given user query q (denoted as P(w | q)) can be calculated using Bayes theorem:

bayes formula
and the autocomplete function should suggest the most probable word after the user typed query q:
most probable word
as P(q) is constant for all candidate words w from W the expression for finding most probable suggestion for user typed query q simplifies to
most probable word
So, word w is selected as a suggestion given user query q if the product of the two factors above is maximal among words the dictionary W.

P(w) is called prior probability and denotes the probability that suggestion should be w even before we know what the entered query q was. The prior probability of w can be estimated from the frequency of searches on Google Maps. As this information was not available to me I needed to use something else instead. Probing various cities popularity in Google Trends shows that larger cities are usually searched for more often than smaller cities. Therefore, assigned prior probability P(w) that city w will be suggested is set to be proportional to the logarithm of city size.

P(q | w), the other factor which determines the probability of choosing w as a suggestion denotes the probability that user types query q when she is searching for city w. P(‘Paris’ | ‘Paris’) denotes the probability of no typing errors when user searches for Paris. This probability is high for good typist. However, other queries like ”Pais’ are also probable: P(‘Pais’ | ‘Paris’) is high since query ‘Pais’ only differs from suggestion ‘Paris’ for one missing character. The situation is similar in the case of P(‘Pars’ | ‘Paris’), P(‘Prais’ | ‘Paris’), and many other mistyped queries similar to Paris.

Estimation of P(q | w) is based on the notion of the similarity of q and w. The more similar query q and candidate suggestion w are the higher the probability P(q | w).
We can transform query q in a candidate suggestion w in a series of one character transformations: insertion of a character, deletion of a character, substitution of a character, and transposition of two adjacent characters. Query ‘Pasi’ can be transformed into Paris by insertion of the character ‘r’ in the third position of the query followed by the transposition of characters ‘s’ and ‘i’. All transformations are not equally probable (e.g. near misses of the keyboard keys are more likely that far misses; transpositions are usually at the medial characters etc.). If we weight each transformation on a path from q to w by the corresponding probability we can define P(q | w) as the product of the probabilities on the most probable path from q to w.

After defining P(w) and P(q | w) we need to find suggestion w that maximizes the product of both factors over all candidate words the dictionary W:
most_probable word

Brute force generation of candidate suggestions is not feasible as the size of the dictionary is large; in addition, there are many candidate paths from query q to a suggestion w.

Best-first search can be used for searching for the best suggestion. Suppose we store all words in the dictionary in a trie. The leaves of the trie represent dictionary words. We assign P(w) as the probability of a leaf w. Interior trie nodes represent dictionary words prefixes. The probability of an interior node of a trie which represents word prefix w* is set to
prior string prefix
We start best-first search in state <q, w, 1.0>. The last parameter is the probability assigned to the state. In every state all single character transformations of q are generated. We can immediately eliminate all transformations that are not compatible with dictionary (e.g. if the dictionary contains color names and query starts with the character x we can deduce that transposition is not possible in the first query position as there is no color with character ‘x’ at the second position). Every valid transformation results in a new state <q’, w’, P (parent(q’, w’)) * P(<q, w> -> <q’, w’>) * P(w’)> where P(a->b) denotes the transformation probability from state a to b and P(w’) the prior probability of prefix w’. Search ends when state (”, ”, p* = P(q | w) P(w) ) is reached or we run out of candidates. Path from the root of the trie to the leaf represents the most probable suggestion w for query q.

After the first suggestion is found we can continue searching for the next suggestion w’. The search terminates when P(q | w’) P(w’) < *p/100. This is somehow arbitrary criterion is used to weed out non-probable suggestion alternatives.

Implementation details

Dictionary of cities consists of 2.7 million entries. Cities names are either transliterated in Latin characters (Coruña -> Coruna), translated to English (Москва -> Moscow), or transcribed ( جدة -> Jeddah).

In case of Google Maps it seems reasonable to assume that there are more queries against larger cities: searching for New York (pop. 8.2 mio) is probably much more common than searching for New Kent (pop. 18K). Despite exceptions to this rule (e.g. tourist destinations attracting more searches compared to their population size), assigning prior probability of a city suggestion proportional to the city population seems a reasonable choice. Formally:
prior probability
Population information for many cities in the database are missing. In such a case population is set to be 10, resulting in factor 1 in the numerator.

Typing errors have been extensively studied. Most typing errors tend to be simple insertions, substitutions and transpositions of characters. Typing errors are less likely at the beginning of the string and transpositions usually occur in medial characters.
General per-keystroke probability of the error rate is set to 5%, further distributed over: insertion errors (60%), deletion errors (16%), substitution errors (16%), and transposition errors (6%). Substitution and insertion probability error mass is evenly distributed among all candidate transformations (e.g. if there are 10 possible insertions every insertion gets 1/10th of the insertion probability mass). Deletion error in the first query position is multiplicatively penalized with factor 0.05, at the second position with factor 0.1. Insertion and substitution at the first query position are multiplicatively penalized with factors .05 and .1, respectively.

Insertion and transposition probabilities take into account keyboard key distances since near key insertions and transpositions are more likely. Current keyboard distances computation is based on the south Slavic querty keyboard but can be easily adapted to other keyboard types.

Naive generation of all possible successor states can result in a large number of states explored. This can be alleviated by

  • fixing order of generation of successor states,
  • dividing successors into three sets (left candidates interval, best successor node, right candidates interval),
  • setting interval probability as the best interval member probability,
  • setting generation order in such a way that intervals are typically large (first evaluate insertions, then (possible) match, then substitutions, character deletion, and transposition).

Taking into account the size of the dictionary and the ambiguity of the query the number of expanded nodes is quite reasonable:

query nodes suggestions
Lis Agne 2,402 Los Angeles, Los Angelitos, Les Agnels, Les Agneliers, Les Agneliers Bas
nw yr 1,098 New York, Nwoya, Nweyon, Na Yan, Nwarem
cpenh 574 Copenhagen, Cepni, Spencer, Cienega, Cuenha


Autocomplete is implemented in C++. Use it as follows

TAutocomplete ac;
ac.load("cities.txt");                 // load dictionary once
vector suggestions;                    // autocomplete suggestions
ac.autocomplete("cpenh", suggestions); // find suggestions for input 
                                       // query "cpenh"; max number 
                                       // of sugegstions (default 5) 
                                       // can be set as the last 
                                       // parameter of the method

Standalone C++ sources for autocomplete and Mongoose web server port is available at GitHub.


Fixing online communities: <br />A biological approach

It’s an old story, repeated many times. Evolution in the making. A new online community is born. Founding members of the community define its core values, writing style, and topics worth considering. The community grows and after a while inevitably fragments into subgroups with markedly distinct interests and sets of values. Community posts complaining about irrelevant topics or trolling comments are the first indicator of the changes to come. It seems inevitable; despite various protective measures by the community newcomers take over. Community is changed for good (or worse). Newcomers win and old-comers leave for greener pastures elsewhere.

Communities employ various tactics to protect themselves from users and content considered harmful. Some require users to earn karma points to be able to post, comment or rank others. Elaborate page-rank like reputation systems have also been devised. And, if everything fails good old censorship can be adopted. Just rename trustworthy members of the community from censors to the moderators first and grant them rights to delete content and ban users from the community.

There seems to be a tradeoff between openness of the community and its vulnerability to vandalization. Censorship can be very effective measure of protection, however, the openness and full potential of growth of the community is inevitably lost. On the other hand, completely unprotected community results in tons of bad posts and troll comments. Users can still skip them but their valuable time and loyalty to the community is wasted.

As the title of this post hints Nature has dealt with similar problems; evolution produced robust and resource-allocation efficient solutions for them. We can learn and adopt solution to our needs. Fundamentally, the problem we are dealing with is a decision problem. Is a post worth publishing; is a comment civil and appropriate?

How bees do it

Imagine a bee which finds a source of food. Happy bee immediately returns to the hive and starts elaborate dance to guide other bees to the location of the food. Does the entire hive flock in the direction indicated by a happy bee? Of course not. It would be too risky to deploy all hive’s resources based on an opinion of only one bee. The happy bee might err signalling wrong direction or the source of the food found might be insufficient for all bees in the hive. Instead, only a handful of scouts fly in the direction of food source. If the scouts perform the dance on their return more bees head towards the target. If not, the target is abandoned. Fewer hive resources are wasted on unpromising sources of food. The decision making process is simple, robust and reasonably accurate.

The adoption of this principle by online communities is straightforward. In this analogy community members are bees, community is a hive and happy bee is a member of the community submitting a post or a comment. At random, a small scout group of community members is selected. If the scout group majority decides that the post is not appropriate or against community founding principles it is discarded. Since a malicious post is not an honest error as in the case of a happy bee penalties for trolls are in order: poster and all up voters are downvoted or even denied the right to post again. If, on the other hand, the post is accepted by scout group whole community votes on the submitted post.

How efficient is it

The mechanism described above is resource allocation efficient. Only a scout group allocates its time and is exposed to malicious content. As members of the scout group are randomly selected there is no need for the privileged class of community censors. Burden of filtering per community member is low and evenly distributed over all community members. The bee scout group size is small (15-20). Unless the community is seriously infected by trolls small scout groups should suffice.

For the math geeks out there: if there are n total members in the community and t of them are trolls then selecting exactly k trolls in a scout set of size s is a random variable X following hypergeometric distribution:
hypergeometric distribution
To filter out malicious messages scout set members perform majority voting. Scout set successfully filters out malicious messages if the trolls are not in a majority (i.e. number of trolls k =< s/2) and are outvoted by other members of scout set. Therefore the probability of a successful filtering of a malicious message by scout set community members is given by:
filtering probability

A caveat: if immediate filtering of messages is required only currently online part of the community should be taken into account in calculations of filtering probability. In such a a case n stands for the number of online members of the community and t for the number of online trolls. You need to take this into account if online presence of your community varies during the day/week cycle. In such a case trolls may wait for off peak hours to get a better chance to be included in scout set and outvote other members in the filtering process.

So let’s plug in some numbers for filtering malicious messages with 20 scouts:

community size 1000 1000 1000 5000 5000 5000
troll share 10% 20% 30% 10% 20% 30%
filtering rate 100% 99.95% 98.37% 99.99% 99.94% 98.30%

Filtering, even in a case where 30% of community members are trolls, is almost perfect. Not a shabby result for 20 bees in action.