LJ Archive

Work the Shell

Words from Letter Sets, Part II

Dave Taylor

Issue #258, October 2015

Dave continues to build the block letter word puzzle, exploring algorithms and edge cases.

Last month, I started to explore how to figure out what words we could make from a set of letter cubes, based on a letter that a reader sent me. If you think about the problem, however, you realize that it has more general applicability in games like Scrabble and Words with Friends too: “Here's a set of letters, some possibly repeating, what words can I make with them?”

Of course, in games where there are interlocking words, there's always a spare letter or two that you're working with above and beyond what's on your rack or in your hand, but the same basic principle holds true, right? In my last article, we ascertained that the fast way to identify possible word candidates for a set of letters was through a rather tricky regular expression:

^[word]{length}$

You'll doubtless recall that ^ is the special character that matches the very beginning of the line, and $ is the one that matches the last character. In a regular expression (or “regex” as we old-timers call them), anything with a {} notation is a repeat count, so what we're basically looking for here are words that are composed exclusively of the letters in the [] notated set that are {} characters long.

In practice, ^[chicken]{7}$ matches the word chicken (of course), along with any other seven-letter words that can be spelled by using just those letters—which turns out to be “chicken”. So, that wasn't a great demo. But, let's say we remove the trailing character, allowing matching words that include all the letters but are longer than just seven characters. Now the results are a smidgen more interesting:

checking
chicken
chickens
chinning

You can see that they don't work simply by length, but the first has a “g” that we don't have in the original set, the third an “s”, and the last has one “n” more than the original pattern included.

That's why this is a two-step solution!

Let's capture this first portion in code before we proceed, however, keeping in mind that we can simplify the expression by removing duplicate characters (so “chicken” can become “chiken”). Here's my code snippet:


length="$(echo "$1" | wc -c)"
length=$(( $length - 1 ))

unique="$(echo $1 | sed 's/./&\
/g' | tr '[[:upper:]]' '[[:lower:]]' | sort | \
uniq | fmt | tr -C -d '[[:alpha:]]')"

regex="^[$unique]{$length}"

echo "Raw word list of matches for letterset $unique:"

grep -E $regex "$dictionary"

As you immediately can see, the real work of the script is in removing duplicate letters from the word pattern—a rather crazy-complicated pipe that works like this:

echo user specified pattern |
  append a carriage return to each individual letter |
  translate uppercase to lowercase |
  sort the letters alphabetically |
  remove duplicates with "uniq" |
  merge the multiple lines to a single line with "fmt" |
  remove any non-alphabetic characters with "tr"

That's a lot of work, but it's a tricky task. We need to explode the word so it's all separate letters, process those letters, then reassemble the word afterward.

In this case, running this with our word of “chicken” (but with the trailing $ still missing) produces:

Raw word list of matches for letterset cehikn:
checking
chicken
chickens
chinning

Notice the alphabetically sorted, dupe-removed “cehikn” in the prompt.

So we don't forget, let's go back into the code and change the pattern assembly line to include the $:

regex="^[$unique]{$length}$"

But first....

Back to Those Letter Blocks

The thing about everything we've written so far is that it assumes we want to use every single letter every time we run the program, and that might not be a valid assumption. For example, perhaps a four-letter word out of five blocks is an acceptable solution, with the spare block put aside for that particular day.

To match that scenario, a single comma will do the work as long as the length parameter in the regular expression indicates the minimum acceptable length. In other words, instead of something like [chicken]{7}, use a pattern like this: [chicken]{5,}.

The good news? This provides a lot more matches:

Raw word list of 5 or longer matches for letterset cehikn: 
check
cheek
chick
chicken
chink
hence
niche
niece

What I did in the code was change the length calculation always to calculate two less than the actual length—really the only change needed:

minlength=$(( $length - 3 )) 

Well, I guess I tweaked the echo statement too:

echo "Raw word list of $minlength or longer matches \
   for letterset $unique:"

Still, that's not much of a change and a big boost to the number of possible words that match the given set of characters!

But which actually can be assembled without needing duplicates of letters that aren't available?

Making Sure We Use Only the Available Letters

My first thought on solving this problem involved sorting the letters in the words and comparing, but that works only if the two words are the same length. Consider the pattern “cats” and the word “cast”. Sort them by letter and they're the same: “acst”. Easy.

But can you spell “chick” with the letters in “chicken”? Can you spell “cheek”? Yes, and no.

The smarter way to manage the process is through a “picking” algorithm where the program goes through each letter of the input pattern (for example, the letters on the blocks) and checks against the potential match. Each time a letter is matched, it's removed. This then makes it easy to identify the situation where there are letters that are used too often, as the double “e” of “cheek” above demonstrates.

If there are letters that aren't axed in the potential word by the time all the letters of the pattern are exhausted, that's a fail too (representing a letter that doesn't appear in the input set), but that already should be caught earlier in the program when we are screening against our letter set.

In other words, the word “cheek” can't be made out of “chicken”, but neither can “chicky”: there's no “y”. But that situation will have been identified in the first half of the program anyway, and the word would never have made it into the test list at this point.

Still, a bit of analysis shows that it's not really a big deal because the algorithm remains the same: start with the test word and the pattern. For each letter in the pattern, remove that letter from the test word if present. When done, the test word should be consumed entirely by the substitution process and no letters should remain standing.

If it meets the minimum length requirement (because “hi” shouldn't be a good match for the letter set “chicken”) and has no remaining letters, it's a win! We've found another word that can be made from the letters of the pattern.

And coding it? That's next month. Stay tuned.

Dave Taylor has been hacking shell scripts since the dawn of the computer era. Well, not really, but still, 30 years is a long time! He's the author of the popular Wicked Cool Shell Scripts (10th anniversary update coming very soon from O'Reilly and NoStarch Press) and can be found on Twitter as @DaveTaylor and more generally at his tech site www.AskDaveTaylor.com.

LJ Archive