Finding word frequencies using Erlang
Here’s a follow-up post that provides a better implementation.
I wanted to try and bite off something a little larger today hitting a few areas that seem generally useful:
- Basic text file operations
- Basic string operations
- Using the gb_trees module
- Avoiding any usage of lists:foreach; instead using tail recursion (that whole “thinking in Erlang” thing).
The problem is to produce the word frequency of a specific text file and to print out the frequency information. For example if a file overdunn.txt contained the text:
“Dunn was over Unger and I was over Dunn.” (Capt. Oveur, Airplane II: The Sequel)
The output would be:
Dunn: 2
I: 1
Unger: 1
and: 1
over: 2
was: 2
Word frequencies have some practical uses (fuzzy text matching, building tag clouds, etc). So while it’s a bit contrived it is the basis for something useful.
The code doesn’t need a lot of explanation – so here you go …
-module(wordlist). -export([print_word_counts/1]). %% matches/* and words/1 From: http://www.trapexit.org/Matching_Words matches(H,{match,M}) -> matches(H,M,[]). matches(_,[],Acc) -> Acc; matches(H,[{I,L}|T],Acc) -> matches(H,T,[lists:sublist(H,I,L)|Acc]). words(String) -> matches(String,regexp:matches(String, "[A-Za-z0-1]+")). %% builds a tree of word/count pairs. If the word does not exist in %% the tree it is added with an initial value of 1. If the word does %% exist the count is retrieved and incremented build_word_tree([], Tree) -> Tree; build_word_tree([W|R], Tree) -> case gb_trees:is_defined(W, Tree) of true -> Count = gb_trees:get(W, Tree), NewTree = gb_trees:update(W, Count + 1, Tree), build_word_tree(R, NewTree); false -> NewTree = gb_trees:insert(W, 1, Tree), build_word_tree(R, NewTree) end. %% reads the next line from the file. If there is data then... %% split the data into a list of words and add those to the word tree process_each_line(IoDevice, Tree) -> case io:get_line(IoDevice, "") of eof -> file:close(IoDevice), Tree; {error, Reason} -> file:close(IoDevice), throw(Reason); Data -> NewTree = build_word_tree(words(Data), Tree), process_each_line(IoDevice, NewTree) end. %% walks the gb_tree and prints each word/count pair print_tree(Iter) -> case gb_trees:next(Iter) of none -> ok; {Key, Val, NewIter} -> io:format("~s: ~w~n", [Key,Val]), print_tree(NewIter) end. %% opens the indicated file, processes the contents and prints %% out the word/count pairs to stdout print_word_counts(Filename) -> case file:open(Filename, read) of {ok, IoDevice} -> Tree = process_each_line(IoDevice, gb_trees:empty()), print_tree(gb_trees:iterator(Tree)); {error, Reason} -> io:format("~s~n", [Reason]) end.
As usual – I’m just getting started with Erlang. What is the good, bad and ugly with this code?
Tags: Erlang

While there’s nothing “wrong” with your code, per se, a suggestion – Erlang now has PCRE-style regular expressions, and the ‘regexp’ module is being deprecated.
So you could capture all of the words in the string with something like this:
words(String) -> {match, Captures} = re:run(String, "\\b\\w+\\b", [global,{capture,first,list}]), [hd(C) || C<-Captures]. % re returns the capture as a nested listThen I’d probably use some kind of key-value structure, like a proplist or dict, to count the words using a lists:foldX function.
@Sean
Thanks! The re module certainly is an improvement.
I need to think more about lists:foldX.
lists:foldl(fun(W, Dict) -> dict:update(W, fun(C) -> C + 1 end, 1, Dict) end, dict:new(), ["foo", "foo", "bar"]).
Edit: I said the above didn’t work but in fact it did. I was looking at the output in erl and forgettting the output is a dict, not a list.
[...] Cribbs was nice enough to point out a pair of changes I could make to my word frequency counter from last [...]
You might want to do this for the words function as it doesn’t handle empty string:
words(String) ->
case re:run(String, “\\b\\w+\\b”, [global,{capture,first,list}]) of
{match, Captures} ->
[hd(C) || C
[]
end.