subscribe

C# developer by day, Ruby and budding Erlang enthusiast by night. Father, husband and all around good guy.

July 2nd, 2009

Sean Cribbs was nice enough to point out a pair of changes I could make to my word frequency counter from last time.

Based on his feedback I made three changes. First – the regular expression code has changed from this:

matches(H,{match,M}) -> matches(H,M,[]).
matches(_,[],Acc) -> Acc;
matches(H,[{I,L}|T],Acc) ->
    matches(H,T,[lists:sublist(H,I,L)|Acc]).
 
words(String) -> matches(String,regexp:matches(String, "[A-Za-z0-1]+")).

to this:

words(String) ->
  {match, Captures} = re:run(String, "\\b\\w+\\b", [global,{capture,first,list}]),
  [hd(C) || C<-Captures].

That last line took me a bit to grok. It’s a list comprehension (if you are reading Joe Armstrong’s thesis it is section 3.3.13. In Erlang Programming it is chapter 9.3). Basically it’s saying “for each list in the list of matches take the head of the list” – a-gigga-wah?

Ok. Let’s go to erl.
7> re:run("foo foo bar", "\\b\\w+\\b", [global,{capture,first,list}]).
{match,[["foo"],["foo"],["bar"]]}

Observe that re:run returns a nested list (i.e. a list of lists) – and each list has exactly one element (the string [which is itself a list but I'll cal them strings]). What we want to do is take that list-of-lists-of-strings and turn it into a list-of-strings.

That’s what “[hd(C) || C<-Captures].” does – it pulls every capture (a word wrapped in a list) from the match list and runs it through erlang:hd which pulls the word from the list – then it gets added to the resulting list. So we end up with a list strings.

It’s un-nesting the list.

Next Sean suggested “Then I’d probably use some kind of key-value structure, like a proplist or dict, to count the words using a lists:foldX function.”

so I fired up “erl -man lists” to learn what foldX meant (actually “foldl” “foldr” depending on whether you want to fold from the left or right.

In a nut shell folding is iterates over a list calling a fun that takes the current value and an accumulator and which returns the new accumulator. An example from the man page is:


> lists:foldl(fun(X, Sum) -> X + Sum end, 0, [1,2,3,4,5]).
15

I spent some time thinking and after some trial and error came up with this:

lists:foldl(fun(W, Dict) -> 
    dict:update(W, fun(C) -> C + 1 end, 1, Dict) end, dict:new(), 
    ["foo", "foo", "bar"]).  %% sample input

In a nutshell – for every word in the list update the dictionary by calling the fun which increments the count value, setting the initial count to 1 if the value does not already exist in the dictionary (and starting with an empty dictionary).

After these three changes the new program is about half the size of the previous and really only has a few interesting lines surround by nearly error and flow control.

Thanks Sean!

The new code …

-module(wordlist).
 
-export([print_word_counts/1]).
 
words(String) ->
  {match, Captures} = re:run(String, "\\b\\w+\\b", [global,{capture,first,list}]),
  [hd(C) || C<-Captures].
 
%% reads the next line from the file.  If there is data then...
%% split the data into a list of words and add those to the word dict
process_each_line(IoDevice, Dict) ->
    case io:get_line(IoDevice, "") of
        eof -> 
            file:close(IoDevice),
            Dict;
        {error, Reason} ->
            file:close(IoDevice),
            throw(Reason);
        Data ->
            NewDict = lists:foldl(
                        fun(W, D) -> dict:update(W, fun(C) -> C + 1 end, 1, D) end, 
                        dict:new(), 
                        words(Data)),
            process_each_line(IoDevice, NewDict)
    end.
 
print_dict(Dict) ->
    dict:fold(fun(Word, Count, AccIn) -> 
        io:format("~s: ~w~n", [Word, Count]), AccIn end, void, Dict).
 
%% opens the indicated file, processes the contents and prints
%% out the word/count pairs to stdout
print_word_counts(Filename) ->
    case file:open(Filename, read) of
        {ok, IoDevice} ->
            Dict = process_each_line(IoDevice, dict:new()),
            print_dict(Dict);
    end.
Share and Enjoy:
  • Print this article!
  • Twitter
  • Digg
  • Reddit
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • StumbleUpon
2
Comments

Tags:

July 2nd, 2009

Here’s a follow-up post that provides a better implementation.

I wanted to try and bite off something a little larger today hitting a few areas that seem generally useful:

  1. Basic text file operations
  2. Basic string operations
  3. Using the gb_trees module
  4. Avoiding any usage of lists:foreach; instead using tail recursion (that whole “thinking in Erlang” thing).

The problem is to produce the word frequency of a specific text file and to print out the frequency information.  For example if a file overdunn.txt contained the text:

“Dunn was over Unger and I was over Dunn.” (Capt. Oveur, Airplane II: The Sequel)

The output would be:

Dunn: 2
I: 1
Unger: 1
and: 1
over: 2
was: 2

Word frequencies have some practical uses (fuzzy text matching, building tag clouds, etc).  So while it’s a bit contrived it is the basis for something useful.

The code doesn’t need a lot of explanation – so here you go …

-module(wordlist).
 
-export([print_word_counts/1]).
 
%% matches/* and words/1 From: http://www.trapexit.org/Matching_Words
matches(H,{match,M}) -> matches(H,M,[]).
matches(_,[],Acc) -> Acc;
matches(H,[{I,L}|T],Acc) ->
    matches(H,T,[lists:sublist(H,I,L)|Acc]).
 
words(String) -> matches(String,regexp:matches(String, "[A-Za-z0-1]+")).
 
 
%% builds a tree of word/count pairs.  If the word does not exist in 
%% the tree it is added with an initial value of 1.  If the word does
%% exist the count is retrieved and incremented
build_word_tree([], Tree) -> Tree;
build_word_tree([W|R], Tree) ->
	case gb_trees:is_defined(W, Tree) of
		true ->
			Count = gb_trees:get(W, Tree),
			NewTree = gb_trees:update(W, Count + 1, Tree),
			build_word_tree(R, NewTree);
		false ->
			NewTree = gb_trees:insert(W, 1, Tree),
			build_word_tree(R, NewTree)
	end.
 
%% reads the next line from the file.  If there is data then...
%% split the data into a list of words and add those to the word tree
process_each_line(IoDevice, Tree) ->
	case io:get_line(IoDevice, "") of
		eof -> 
			file:close(IoDevice),
			Tree;
		{error, Reason} ->
			file:close(IoDevice),
			throw(Reason);
		Data ->
			NewTree = build_word_tree(words(Data), Tree),
			process_each_line(IoDevice, NewTree)
	end.
 
%% walks the gb_tree and prints each word/count pair
print_tree(Iter) ->
	case gb_trees:next(Iter) of
		none -> ok;
		{Key, Val, NewIter} ->
			io:format("~s: ~w~n", [Key,Val]),
			print_tree(NewIter)
	end.
 
%% opens the indicated file, processes the contents and prints
%% out the word/count pairs to stdout
print_word_counts(Filename) ->
	case file:open(Filename, read) of
		{ok, IoDevice} ->
			Tree = process_each_line(IoDevice, gb_trees:empty()),
			print_tree(gb_trees:iterator(Tree));
		{error, Reason} ->
			io:format("~s~n", [Reason])
	end.

As usual – I’m just getting started with Erlang.  What is the good, bad and ugly with this code?

Share and Enjoy:
  • Print this article!
  • Twitter
  • Digg
  • Reddit
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • StumbleUpon
4
Comments

Tags:

July 1st, 2009

The first episode of The X-Files I ever saw was the Home.  I was hooked immediately – and at the same time somewhat disappointed.  This was before on-demand TV, hulu or DVD compilations on NetFlix.  I had missed more than 3 seasons of a what was clearly going to be one of my favorite shows and there the only way to catch up was the hope that the next rerun would be one I hadn’t seen.  Then it got all weird and aliens were showing up too often, Mulder basically disappeared and The Lone Gunman got their own spinoff …

Learning Erlang is a little like seeing the first episode.  I’m hooked immediately – and at the same time somewhat disappointed.  I’m years behind again.  At least this time I don’t need to fret over not being able to find the reruns I want and there is very little risk that Dean Haglund will make an appearance (I bet Langly knew Erlang).

A few minutes ago I had my first significant Aha! moment.  In my effort to map what I do know (procedural/OO paradigms) to Erlang I wondered “how would a balance tree be implemented in Erlang”.  I binged (yeah, I said it) for “Erlang balanced trees” and quickly found the gb_trees module.

What I wanted to understand was:

  1. How is the tree represented in Erlang?
  2. How is the tree (and it’s nodes) modified (insert, update, delete, etc)?
  3. Can processes share the same tree (and how is concurrency handled)?

How is the tree represented in Erlang?

This took me the longest to understand – but once it fell into place everything else made so much more sense.  If you are curious go ahead and open up gb_trees.erl (on my machine it’s in /usr/lib/erlang/lib/stdlib-1.15.5/src) – now go to the definition of the node and tree types.  Go ahead… I’ll wait.

Found them yet?  No?  Maybe it’s in some header file.  Feel free to check … in the meantime I’ll continue.

Ready for this?  There isn’t one.  You don’t need to define a tree or node type – it’s just a tuple that gets passed around and interpreted at runtime.  Not quite clear?  OK – if we were going to define it, it would look something like this:

{Size, {Key, Value, Smaller, Bigger}}

  • Size is the number of nodes in the tree (which is used for balancing).
  • Key is the key of the current node.
  • Value is the value associated with Key.
  • Smaller is the nodes that are smaller (i.e. “to the left”).
  • Bigger is nodes that are bigger (”to the right”).

These functions really helped make it much clearer:

%%-spec(empty/0 :: () -> gb_tree()).
 
empty() ->
    {0, nil}.
 
%%-spec(is_empty/1 :: (gb_tree()) -> bool()).
 
is_empty({0, nil}) ->
    true;
is_empty(_) ->
    false.
 
%%-spec(size/1 :: (gb_tree()) -> non_neg_integer()).
 
size({Size, _}) when is_integer(Size), Size >= 0 ->
    Size.

“size” says so much it’s wonderful.  Besides the obvious of getting the size of the tree it also validates that the size is greater-than-or-equal-to zero otherwise there will be an exception.  This is a great example of Erlang failing fast.  Try it out:

2> gb_trees:size({-1, {[]}}).
** exception error: no function clause matching gb_trees:size({-1,{[]}})

See – the clause doesn’t match and an exception is thrown.  Think of the comparable C++ code.  It would look something like:

int size(Tree *tree) {
    if (tree == null) {
        throw ...;
    }
 
    if(tree->size < 0) {
        throw ...;
    }
 
    return tree->size;
}

I'm Loving It

How is the tree modified?

I think gb_trees:update is the most approachable example of this.  Besides being crazy small it is a very clear example of how the tree is walked to search for the key and how error handling is meant to be done.

%%-spec(update/3 :: (_, _, gb_tree()) -> gb_tree()).
 
update(Key,  Val, {S, T}) ->
    T1 = update_1(Key, Val, T),
    {S, T1}.
 
%% See `lookup' for notes on the term comparison order.
 
update_1(Key, Value, {Key1, V, Smaller, Bigger}) when Key < Key1 -> 
    {Key1, V, update_1(Key, Value, Smaller), Bigger};
update_1(Key, Value, {Key1, V, Smaller, Bigger}) when Key > Key1 ->
    {Key1, V, Smaller, update_1(Key, Value, Bigger)};
update_1(Key, Value, {_, _, Smaller, Bigger}) ->
    {Key, Value, Smaller, Bigger}.

Notice first that since update is never changing the size it does not bother to pass it to update_1 – only the “node” portion of the tuple is passed.

Once passed there are four possible options – three of which need to be codified (in the same order as the code):

  1. The provided key is smaller than the current tree node.  In this case build up a new result and call update_1 with update_1(Key, Value, Smaller) – “walk left”
  2. The provided key is larger than the current tree node.  Walk right.
  3. The provided key is exactly the current node key (not too big … not to small … just right!).  In this case build up a new node with the new value.
  4. (implicit) The provided key is not in the tree.  Throw an exception.

In the case where the node is not found update_1 ends up being called like:

update_1(Key, Value, nil)

Since the match on #1-3 require a non-nil tuple this turns into a missing function clause error.

Can processes share the same tree?

No.  No they can’t.  I was having a little trouble groking this until I ran through some samples in erl.  This is what made it clear for me:


1> Tree = gb_trees:empty().
{0,nil}
2> Tree2 = gb_trees:insert("Foo", "Bar", Tree).
{1,{"Foo","Bar",nil,nil}}
3> Tree3 = gb_trees:update("Foo", "NewBar", Tree2).
{1,{"Foo","NewBar",nil,nil}}
4> Tree4 = gb_trees:update("Missing", "value", Tree3).
** exception error: no function clause matching gb_trees:update_1("Missing","value",nil)
in function gb_trees:update_1/3
in call from gb_trees:update/3

Each time anything was done a new tree was created – the old tree was still valid and could be used but it was not the new tree.  It’s the old tree.  This boggled me for a few minutes until Justin Sheehy confirmed what the code was saying.

@bubbafat They are immutable data structures. In functional programming, "update" is nondestructive

So no – multiple processes could not possibly share the tree because the tree is immutable.

Want proof?  Go back to erl and run gb_trees:get on the key “Foo” on Tree, Tree2 and Tree3.  Tree will error (it’s an empty tree).  Tree2 will find “Bar” and Tree3 will find “NewBar”.


14> gb_trees:get("Foo", Tree).
** exception error: no function clause matching gb_trees:get_1("Foo",nil)
15> gb_trees:get("Foo", Tree2).
"Bar"
16> gb_trees:get("Foo", Tree3).
"NewBar"

Aha!

Share and Enjoy:
  • Print this article!
  • Twitter
  • Digg
  • Reddit
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • StumbleUpon
No
Comments

Tags:

June 29th, 2009

In my previous post I wrote a simple file system walker using Erlang for the purposes of getting my head around the syntax.  To start thinking more in an Erlang mindset (though in a very contrived manner) in this post I convert the file system walker to use two processes.  The first process (”walk”) performs the file system walking and the second (”visit”) processes each found item (i.e. prints the full item path name).

Something along the lines of this…

walk-visit

Since walk and visit are the process entry points they need to be in the export list (by the way – forgetting to do this does not cause a compiler error in erl – why is that?) and we need a new entry function to create the new processes (start/1).

start(Path) ->
	Visit_PID = spawn(walkproc, visit, []),
	spawn(walkproc, walk, [Path, Visit_PID]).

On line 2 we create the process whose pid is assigned to Visit_PID. This process calls the visit() function with 0 parameters. At this point visit is running and waiting to receive a message.

On line 3 we spawn off another process. This process calls the walk/2 function as walk(Path, Visit_PID). Since the walk function is doing the file system walking it makes sense that it will be the one sending the first message. Because of this it needs to know the PID of the process to send the message to.

visit is a very straight forward function. It receives a message, processes it, sends a response and recurses (rinse and repeat).

visit() ->
	receive
		{Path, Walk_PID} ->
			io:format("~s~n", [Path]),
			Walk_PID ! next,
			visit()
	end.

walk is very similar (and has not changed substantially from our previous version. it starts by firing a message off to visit with the Path passed as a parameter. Next it waits for the “next” message. Once it receives that it gets the next file system entry and recurses.

walk(Path, Visit_PID) ->
	Visit_PID ! {Path, self()},
	receive
		next ->
			FileType = file_type(Path),
			case FileType of
				file ->
					ok;
				symlink ->
					ok;
				directory ->
					Children = filelib:wildcard(Path ++ "/*"),
					lists:foreach(fun(P) -> walk(P, Visit_PID) end, Children)
			end
	end.

The other methods (file_type and is_symlink) have not changed.

I enjoyed how easy it was to convert to a multi-process approach and am looking forward to moving to a solution that uses RabbitMQ.

Share and Enjoy:
  • Print this article!
  • Twitter
  • Digg
  • Reddit
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • StumbleUpon
No
Comments

Tags:

June 29th, 2009

I’m learning Erlang.  I’ll get into “why” in some other post – the purpose here is to share my first sample program and solicit feedback.  The purpose of the program is to start print the contents of a file system from the indicated point downwards (ignoring symlinks).

The application has a single module, walker, which exports walk/1. The argument to walk/1 is the starting path. For example:

> walker:walk("/home").

This method prints the path name, determines the type of the current path (file, directory or symlink), and then IFF the path is a directory it calls filelib:wildcard to get the children of the path and repeats the process on them.

-module(walker).
 
-include_lib("kernel/include/file.hrl").
 
-export([walk/1]).
 
is_symlink(Path) ->
	case file:read_link_info(Path) of
		{ok, #file_info{type = symlink}} ->
			true;
		_ ->
			false
	end.
 
file_type(Path) ->
	IsRegular = filelib:is_regular(Path),
	case IsRegular of
		true ->
			file;
		false ->
			case is_symlink(Path) of
				true ->
					symlink;
				false ->
					directory
			end
	end.
 
walk(Path) ->
	io:format("~s~n", [Path]),
 
	FileType = file_type(Path),
	case FileType of
		file ->
			ok;
		symlink ->
			ok;
		directory ->
			Children = filelib:wildcard(Path ++ "/*"),
			lists:foreach(fun(P) -> walk(P) end, Children)
	end.

My questions about this module are:

  1. Does calling walk(P) in a foreach prevent tail recursion optimizations?
  2. Where I have “case FileType of” (in walk/1) is there a more succinct way to express that?
  3. Why doesn’t read_file_info ever return file_info#type==symlink?
  4. How should this have really been done?

I’ll be working on answering #1-3 on my way to learning #4 – but if you have any feedback I would love to hear it.

The next step is to make this message based and have the walk/1 method send messages to a consumer who will do the printing.

The next-next step is to RabbitMQ and setup one producer and three consumers – one for files, one for directories and one for symlinks. The walk/1 method will no longer print the file info but rather send the appropriate message and let the consumers print the messages.

Share and Enjoy:
  • Print this article!
  • Twitter
  • Digg
  • Reddit
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • StumbleUpon
1
Comment

Tags: