Every Linux system comes with a dictionary file: a huge list of words in your language. This “words” file is useful for spell-checking, but you can put it to use in many different ways. Learn more about Linux commands with some wordy diversions.
1 Find Words With a Certain Length
The dictionary file—usually located at /usr/share/dict/words—is particularly useful for fans of word games and similar puzzles. Imagine you’re looking for a word to complete a crossword puzzle, or one to fit in a particular space in a design. You might want a list of words that are exactly a specific length. With the words file and some common Linux utilities like grep or awk, this is an easy problem to solve.
You’re probably more familiar with grep than awk, so let’s start with that tool. Here’s how to use a regular expression to get all the words of a certain length:
egrep '^.{22}$' /usr/share/dict/words
egrep is shorthand for “grep -E” which uses extended regular expressions. In this case, it allows for cleaner syntax.
This example is pretty straightforward once you’re familiar with regular expressions:
- The anchor characters—circumflex (^) and dollar ($)—represent the beginning and end of the line. They ensure matches that are complete words, not just part of a word.
- The period (.) is a wildcard that matches any single character. Finally, the {22} repeats the previous item 22 times. So the full expression gives us words exactly 22 characters long:
awk is another tool that makes heavy use of regular expression patterns, but it’s much more powerful than grep—a full-blown language, in fact. This means it has various shortcuts that are invaluable in cases such as this.
Here’s the awk equivalent of the previous grep command:
awk 'length($0) == 22' /usr/share/dict/words
This uses the features $0 to match the entire line, and awk’s built-in length function to count the number of characters in it.
You can use either of these readily available tools for the job. Your choice will depend on factors like efficiency, ease-of-use, and personal preference. If you’re familiar with regular expressions, the grep approach may be easier, but the awk command is a bit more readable.
2 Discover the Longest Word
Building on the previous example, how about finding out what the longest word is? You can do so very easily with the wc (word count) program and its -L flag:
wc -L /usr/share/dict/words
The output tells you the longest word is 28 letters long, which is pretty impressive. To find out what this word actually is, simply reuse the process from the word length example:
awk 'length($0) == 28' /usr/share/dict/words
Dictionary files vary greatly. Of course, different languages will have totally different dictionaries but, between systems, even the same language can have a very different set of words. For example, macOS tells me that “antidisestablishmentarianism” is the longest word, while a remote Ubuntu sytem I login to tells me it’s “electroencephalograph’s.” Maybe macOS is just that little bit more “book smart.”
3 Get Inspiration for Naming Things
The dictionary file has a surprising number of proper nouns, but we can use this to our advantage. Maybe you need a character name for that novel you’re working on or you’re looking for unorthodox suggestions for the name of a newborn.
For whatever reason, if you want a list of possible names, the words file has you covered. Simply search for every line beginning with a capital letter:
grep '^[A-Z].' /usr/share/dict/words
4 Get a Random Word
Getting a random line from a file sounds useful, but there’s no obvious built-in command to do so. However, this demonstrates the power of the Linux pipeline: you can chain a couple of simple commands together to do the job.
First of all, you need to know about the sort command and its -R flag. sort usually orders a set of lines alphabetically or numerically, but the -R flag randomizes the order instead. This simplifies the task: randomize the words and pick the first, which is a job for head:
sort -R /usr/share/dict/words | head -n1
Be sure to use -R, not -r which sorts in reverse order.
This may not be the most efficient solution, but it’s fast enough on a typical dictionary file and a modern computer—and it’s easy to remember!
5 Find Words Without Any Vowels
Curious about the English language? You may have heard the old adage that every word contains a vowel. Let’s disprove that with some simple grep’ing:
egrep '^([^aeiou]){2,}$' /usr/share/dict/words
This expression uses the character class syntax—square brackets ([ and ])—to restrict characters to a given set. It uses negation—the second circumflex (^)—to match any characters that are not in the set. So each character must not be “a,” “e,” etc. The {2,} restricts matches to words with at least two letters; single-letter results aren’t very interesting!
6 Find Words With All the Vowels in Order
We word nerds love our trivia, and a common challenge involves finding a word that contains all vowels in order: a, e, i, o, and u. Here’s a quick grep command that looks for words containing the five vowels, with any number of characters in between:
grep '.*a.*e.*i.*o.*u.*' /usr/share/dict/words
You may notice a slight flaw with this regex: it returns words like “abietineous” which repeat vowels. If you want to be very strict, you can modify the regex, although it gets a bit messy:
grep '[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*' /usr/share/dict/words
This explicitly rules out vowels between the ones we’re searching for. It gives some more satisfying results, including words like “abstemious” and “facetious:”
7 Master Back-References for Smarter Search
Back to grep, and this time we’ll see how to search for a complicated pattern like the same letter twice. This requires the use of back-references which help you search for something that has already been matched. You can find words with a double-letter sequence like this:
egrep "(.)\1" /usr/share/dict/words
Here, the parentheses create a grouped expression which you can then match using a back-reference. The back-reference—\1—matches the first (and only) group in the regex. So this regular expression means “any character followed by the same character:”
There are many words in the English language with two identical letters in a row. But what about three?
egrep "(.)\1\1" /usr/share/dict/words
On macOS, my dictionary contains seven interesting examples:
8 Discover Which Letters Are Used the Most
OK, this is a bit of a cheat since we’ll actually discover which letter appears at the beginning of most words, but this is still interesting information. It’s also another useful demonstration of a pipeline, this time using four separate utilities:
cut -b1 /usr/share/dict/words | tr '[:upper:]' '[:lower:]' | uniq -c | sort -n
“cut -b1” returns the first letter from each line in the file. “tr ‘[:upper:]’ ‘[:lower:]'” converts everything to lowercase. “uniq -c” gives us unique lines, combining adjacent identical lines and counting them as it goes. Finally, “sort -n” sorts the results numerically so the most common leading letter appears at the bottom.
Your results will vary according to your language. In English, it looks like “s” starts the most words, narrowly beating out “p.”
If your dictionary orders all uppercase words first, followed by all lowercase words, add a “| sort” just before the “| uniq.”
These are just some of the fun and interesting things you can do in your Linux terminal. You may also want to check out how to use Spotify in your terminal, or how to create artistic works in the command line.