One last iteration through my learning exercise of building a word frequency list. In this last post I’m moving away from a dict and to an ets table. I was pleasantly surprised how easy the conversion was. For example printing the output was just converting from dict:fold to ets:foldl. The one parity fail was that dict:update can take an initial value when the key is missing but ets:update_counter (nor any other ets function) has this benefit. This required that I write a little wrapper function to call from the list:foldl (instead of having a multi-line inlined fun).

No point in getting too deep into this - here’s the code:

\-module(wordets).

-export([print_word_counts/1]).

words(String) ->
  {match, Captures} = re:run(String, "\b\w+\b", [global,{capture,first,list}]),
  lists:append(Captures).

%% reads the next line from the file.  If there is data then...
%% split the data into a list of words and add to the word table
process_each_line(IoDevice, Table) ->
  case io:get_line(IoDevice, "") of
    eof -> 
      file:close(IoDevice),
      Table;
    {error, Reason} ->
      file:close(IoDevice),
      throw(Reason);
    Data ->
      NewTable = lists:foldl(
        fun(W, T) -> update_word_count(W, T) end, Table, words(Data)),
      process_each_line(IoDevice, NewTable)
  end.
  
update_word_count(Word, Table) ->
  case ets:lookup(Table, Word) of
    [{Word, _}] ->
      ets:update_counter(Table, Word, 1); 
    [] ->
      ets:insert(Table, {Word, 1})
  end,
  Table.

print_words(Words) ->
  ets:foldl(fun({W,C}, AccIn) -> 
    io:format("~s: ~w~n", [W, C]), AccIn end, void, Words).

%% opens the indicated file, processes the contents and prints
%% out the word/count pairs to stdout
print_word_counts(Filename) ->
  {ok, IoDevice} = file:open(Filename, read),
  Words = process_each_line(IoDevice, ets:new(words, [])),
  print_words(Words).

The ets implementation feels a bit forced (which it was - the point was to learn another module). I don’t think I’d have gone this way in practice unless I wanted to persist the frequency data to a file or if the word data were more complex (for example if I were storing information about where in the file the word was, word neighbors, etc).

Enough of this sample. On to something more substantial.