A few months ago when Letterpress was hot, I thought about an efficient algorithm for a cheater application. A trivial solution in Ruby is quite slow for this problem, which motivated me to check out alternative approaches. Abbreviated, the problem is the following:
Given 25 random letters (
letters), find every string in an array of strings (
words) that consists of only those letters.
I’ll start by walking through a naive solution before presenting a data structure to solve this problem efficiently.
The naive approach for this problem is to simply loop through all elements of
words and check whether these words can be formed from the characters in
letters with a frequency map:
A frequency map looks like this:
For an average word length
n words in
words this runs in O(n m)
time. In my tests, this algorithm runs in about 2-4s on my Macbook on the
/usr/share/dict/words. That is not fast enough if this was to be
used for a web application for example, so we dig deeper.
A simple iteration on this naive algorithm, is the observation that if a
character is not in
letters, then we do not have to loop through all the words
that start with this letter in
words. E.g. if there’s no letter c in
letters, we can skip to the words beginning with d when we encounter the first
word which begins with c in
words and so on.
The speed of this iteration depends on how many words in
letters start with
each character in
words. Say that
k is the highest number of words that
starts with any character in
d is the number of distinct
characters from which we can form words, then this runs in worst-case O(d k)
time. In my tests, this was about twice as fast as the previous algorithm.
Another iteration is threading the processing of each group.
Using a Trie
The idea of grouping is similar to the original solution I had in mind, but instead of just grouping on the first character, group on all characters! This creates a neat, recursive structure called a Trie.
For instance, if we put in the word “band”, “ban” and “boo” in the data structure, it will look like this:
That way we can check that “boo” is in the structure with something like:
map[:b][:o][:o]. However, that would also imply that “bo” is in the structure,
which it is not. We need a state on each of the maps that tells whether a word
ends at this letter.
With that in place, we can create a method to create the data structure described above, by going through each character in the added string, creating new Tries as we go:
With that comes the interesting part. The problem is now, given this data
structure, how do we find all the entries in the data structure, which is the
end of a word that we can get to with
Once again we make use of the frequency map explained in the previous section, and then we recursively visit nodes in the data structure. The frequency map is updated as we go down the recursion, so invalid paths can be detected.
The full implementation of the data structure can be seen in this Gist.
I made a few small optimizations to the trie showed here to make it faster. There’s still plenty of things that could be done to make it faster, such as concurrent searching, and there’s probably a bunch of things that can be found profiling. Either way, it’s clear that it is much faster to use a trie over the naive methods presented earlier. The trie uses a lot of memory, this could be optimized by converting it into a radix tree, which could also yield small performance benefits.
user system total real Benchmarking 'ovrkqlwislrecrtgmvpfprzey' faster trie 0.090000 0.000000 0.090000 ( 0.089411) blog trie 0.390000 0.040000 0.430000 ( 0.452724) naive 3.290000 0.140000 3.430000 ( 3.467491) group 2.230000 0.050000 2.280000 ( 2.290674) Benchmarking 'abcdefghifghijklmnopqrstuvxyz' faster trie 0.670000 0.000000 0.670000 ( 0.671967) blog trie 3.020000 0.070000 3.090000 ( 3.094881) naive 4.450000 0.050000 4.500000 ( 4.596610) group 4.120000 0.050000 4.170000 ( 4.208492) Benchmarking 'odidwocswkbafvydehsbiviez' faster trie 0.030000 0.000000 0.030000 ( 0.030700) blog trie 0.100000 0.010000 0.110000 ( 0.103797) naive 2.550000 0.010000 2.560000 ( 2.566604) group 1.640000 0.010000 1.650000 ( 1.658194) Benchmarking 'rtlyifebuzkxndovzyzodelap' faster trie 0.150000 0.000000 0.150000 ( 0.154702) blog trie 0.630000 0.010000 0.640000 ( 0.633183) naive 2.810000 0.010000 2.820000 ( 2.822617) group 2.140000 0.000000 2.140000 ( 2.147578)
About 0.1-0.2s in the worst case on actual Letterpress games is absolutely fine for use in a service. The second case, which takes the most time, is a stress test with every letter of the alphabet. It would never happen in Letterpress.