Writing a URL Shortening Service in Erlang, Part 4

June 29th, 2009

In this post, we will add an HTTP interface to sherl using webmachine. The goal is a simple API that will shorten long URLs and redirect short URLs to their original page. Sending a GET to /http://long.it/is/your/url/here will create a short URL and display it. This will make the service easy to use by hand, as a user can simply prepend “http://sherl.com/” to an address in their browser’s address bar. All GET requests that don’t match “/http:” will be interpreted as short URLs and will redirect (HTTP 301) to the original page or give a 404 error if the specified short code was not found in the sherl database.

The first step is to obtain the latest webmachine code. You can find a tarball under the get source link on webmachine’s bitbucket page, or you can install Mercurial and pull down the latest bits. On OS X, here’s what I did:

sudo port install mercurial
hg clone https://userprimary@bitbucket.org/justin/webmachine/
cd webmachine
make

Next, create a skeleton webmachine project using the new_webmachine.erl script.

$PATH_TO_WEBMACHINE/scripts/new_webmachine.erl sherlweb .
cd sherlweb
make
./start-dev.sh

At this point, you should have a working webmachine application running at
http://localhost:8000/ that will return a Hello, new world message in a browser. Webmachine maps resources to erlang modules in which you define callback functions to implement different parts of HTTP. Webmachine brings the protocol back to web application frameworks and makes it easy to create a REST API that behaves according to the HTTP specification. A nice overview of how webmachine works is provided on the wiki. On that page is a link to an HTTP flowchart that gives a visual explanation to webmachine’s approach. The flowchart presents HTTP as a sequence of decision points covering all of the nooks and crannies of the protocol that you’d probably prefer to ignore. A webmachine resource can define a function for each decision point to control how the request should flow through the diagram. Fortunately, sane default functions are available for nearly all of the decision points so you only need to implement the parts of the protocol that are relevant to your application.

There are three significant benefits to the approach taken by the webmachine framework. First, it is much easier to end up with a web application that actually behaves according to the HTTP spec. This makes your application easier for clients to make use of and allows common pieces of web infrastructure like caches and proxies to play nice with your service. Secondly, by breaking down a resource into side-effect free functions, you end up with code that is easy maintain and easy to test. Finally, webmachine comes with a very clever visual debugger that displays an interactive version of the HTTP flowchart and allows you to inspect the request and decision point function output at each step. In this post we’ll just be scratching the surface of webmachine’s features, but you have to start somewhere.

We’ll begin customizing the skeleton by defining the resources in the sherlweb/priv/dispatch.conf file. This file maps URLs to resource modules and is similar to the routes configuration in a Ruby on Rails application. The dispatch configuration consists of a list of resource definitions. Each definition is a three element tuple: a pattern describing the URLs to match, the name of a module, and initialization parameters to send to the module.

%% dispatch.conf
{["http:", '*'], shorten_resource, []}.
{['*'], lookup_resource, []}.

The first directive will match resources that start with “/http:” and dispatch them to the shorten_resource module. The second directive matches any resource and will dispatch to the lookup_resource module. Matching is done in order with the first match winning so the second definition acts as a catch-all.

Now we’ll implement minimal versions of shorten_resource and lookup_resource. You can use the default sherlweb_resource module as a template and replace the to_html function as follows:

%% shorten_resource.erl
%% module boilerplate excluded
 
to_html(ReqData, State) ->
    "/" ++ LongUrl = wrq:raw_path(ReqData),
    ShortUrl = make_short_url(LongUrl, ReqData),
    {["<html><body><a href=\"", ShortUrl, "\">", ShortUrl,
      "</a></body></html>"],
     ReqData, State}.
 
make_short_url(LongUrl, ReqData) ->
    {ok, Code} = sherl:encode(LongUrl),
    Host = wrq:get_req_header("host", ReqData),
    "http://" ++ Host ++ "/" ++ Code.
%% lookup_resource.erl
%% module boilerplate excluded
 
to_html(ReqData, State) ->
    {ok, LongUrl} = sherl:decode(wrq:disp_path(ReqData)),
    {["<html><body><a href=\"", LongUrl, "\">", LongUrl,
      "</a></body></html>"],
     ReqData, State}.

In addition, we need to update sherlweb.app by removing sherlweb_resource and adding the above two modules (also delete sherlweb_resource.erl from the src directory). Then modify sherlweb.erl so that the sherl application implemented in Part 3 will be started as part of the web app initialization. The modified sherlweb:start/0 looks like this:

%% from sherlweb.erl
start() ->
    sherlweb_deps:ensure(),
    ensure_started(crypto),
    ensure_started(webmachine),
    ensure_started(mnesia),
    ensure_started(sherl),
    application:start(sherlweb).

At this point, rebuilding and restarting the project should allow you to test the basic API and see a very rudimentary HTML-based shortening service. To refine the service, we will iterate on the lookup_resource module so that it will return a 301 redirect for a valid short URL and a 404 otherwise. To do that we will implement the resource_exists, previously_existed, and moved_permanently callbacks. The new version of lookup_resource.erl looks like this:

-module(lookup_resource).
-export([init/1, moved_permanently/2, previously_existed/2,
         resource_exists/2]).
 
-include_lib("webmachine/include/webmachine.hrl").
 
-record(sherl, {long_url = undefined}).
 
init([]) -> {ok, #sherl{}}.
 
resource_exists(ReqData, State) ->
    {Status, LongUrl} = sherl:decode(wrq:disp_path(ReqData)),
    case Status of
        ok -> {false, ReqData, State#sherl{long_url = LongUrl}};
        _ -> {false, ReqData, State}
    end.
 
previously_existed(ReqData, State) ->
    case State#sherl.long_url of 
        undefined ->
            {false, ReqData, State};
        _ ->
            {true, ReqData, State}
    end.
 
moved_permanently(ReqData, State) ->
    case State#sherl.long_url of 
        undefined ->
            {false, ReqData, State};
        LongUrl ->
            {{true, LongUrl}, ReqData, State}
    end.

Based on the flowchart, the resource_exists function will be called first. It looks up the short code using sherl:decode/1 and updates the resource state record with the long URL if it is found. The resource_exists function always returns false since we never want to return content from this resource. We could optimize this a bit and return 404 without further decision function processing in the case that we don’t have a long URL for the requested short code. Implementing this short circuit is left as an exercise for the reader. Next the flow will call previously_existed which returns true if we found a long URL matching the requested short code and false otherwise. When previously_existed returns false, the server responds with a 404. When it returns true, the flow calls moved_permanently and webmachine generates a 301 response.

At this point we have a simple, but functioning URL shortening service backed by sherl, an OTP application that usees mnesia for storage, and fronted (is that a word?) by webmachine. This seems like as good a stopping point as any. I’m cleaning up the code and plan on posting it on github when it’s ready.

UPDATE the code for sherl is now available on github:
http://github.com/seth/sherl/tree/master

Writing a URL Shortening Service in Erlang, Part 3

June 20th, 2009

In this installment, we’ll turn the core sherl_db module into a proper OTP application by creating a sherl gen_server, a supervisor, and a an application resource file. While the sherl_db module can already be used as a server (the tests in part 2 show that it can be called concurrently), setting up a gen_server will provide a standardized framework for packaging the code as well as a nice abstraction layer should we decide to replace the mnesia-based sherl_db module with something else.

First we will tackle the gen_server part of the show. The erlang-mode for Emacs provides a very useful template for creating new gen_servers; even if you don’t use Emacs, you might want to steal the template idea as it proves quite useful. The gen_server will live in a module called simply sherl and in addition to the required gen_server behavior callbacks, exports the following API:

-export([start_link/0, stop/0, encode/1, decode/1]).

A client will call sherl:encode to turn a URL into a code that we will use as a short URL. The code we will use is the base 62 encoding of an integer that the sherl_db module assigns to the given URL. We choose base 62 as it allows us to represent short URL codes using A-Z, a-z, and 0-9 which will keep our short URLs very clean and easy to parse — and in particular, will reserve “.” for use as a format extension so that web users can append “.json” or “.xml” to retrieve the short URL meta data in a desired format. Below is the definition for the main sherl public API:

start_link() ->
    gen_server:start_link({local, ?SERVER}, ?MODULE, [], []).
 
stop() ->
  gen_server:call(?SERVER, stop).
 
encode(Url) ->
    gen_server:call(?SERVER, {encode, Url}).
 
decode(Code) ->
    gen_server:call(?SERVER, {decode, Code}).

While you don’t have to define a wrapper API for a gen_server, it makes using it that much more convenient. With the above API, client code can simply call sherl:encode(Url) instead of gen_server:call(sherl, {encode, Url}).

As you can see, each API call maps to a message sent as a tuple with first element describing the desired action. The server will read the message and dispatch to the appropriate function to respond to the client’s request.

Before going further with the API, a short diversion to implement the init callback. Here it is:

init([]) ->
    sherl_db:start([]),
    {ok, #state{}}.

Next up is implementing the handle_call callback. The is where the real work of the server gets done. In implementing the callback code, one realizes that what is really going on is message passing. The client sends the sherl gen_server a message and the sherl server examines the message to determine how to respond. Here’s what we need to support the API defined above:

handle_call({decode, Code}, _From, State) ->
    case sherl_db:get_url(base62:decode(Code)) of
        undefined ->
            {reply, {not_found, Code}, State};
        UrlRec ->
            {reply, {ok, UrlRec#url.url}, State}
    end;
handle_call({encode, Url}, _From, State) ->
    Rec = sherl_db:get_code(Url),
    Code = base62:encode(Rec#url.code),
    {reply, {ok, Code}, State};
handle_call(stop, _From, State) ->
    sherl_db:stop(),
    {stop, normal, stopped, State};
handle_call(_Request, _From, State) ->
    {reply, ignored, State}.

Our server doesn’t maintain any state between calls, but if it did, we would thread the state data through all the callbacks using the State variable that you see passed in as an argument and returned by all of the callbacks. Ok, takes care of the sherl gen_server. Now we need a supervisor and an application module. As these follow fairly mechanically from their required callbacks, I will simply list them below.

-module(sherl_sup).
-behaviour(supervisor).
-export([start_link/0]).
-export([init/1]).
-define(SERVER, ?MODULE).
 
start_link() ->
    supervisor:start_link({local, ?SERVER}, ?MODULE, []).
 
init([]) ->
    Sherl = {sherl_main, {sherl, start_link, []},
              permanent, 2000, worker,[sherl]},
    {ok, {{one_for_one, 20, 60}, [Sherl]}}.
-module(sherl_app).
-behaviour(application).
-export([start/2, stop/1]).
 
start(_Type, _StartArgs) ->
    case sherl_sup:start_link() of
        {ok, Pid} -> 
            {ok, Pid};
        Error ->
            Error
    end.
 
stop(_State) ->
    ok.

Finally, we need an application resource file (sherl.app) and that goes like this:

%% -*- mode: erlang -*-
{application, sherl, 
 [{description, "sherl URL shortening core service"}, 
  {vsn, "0.1"}, 
  {modules, [base62, sherl, sherl_app, sherl_db, sherl_sup]}, 
  {registered, [sherl, sherl_sup]}, 
  {applications, [kernel, stdlib, mnesia]}, 
  {mod, {sherl_app, []}}, 
  {start_phases, []} 
 ]}.

I put sherl.app in the src directory along with the *.erl files and have the build process copy it to ebin/sherl.app since application resource files need to be on your code path for the application module to find them.

We now have an OTP application. Here’s some example usage from the Erlang shell.

Eshell V5.7.1  (abort with ^G)
1> application:start(mnesia).
ok
2> application:start(sherl).
ok
3> sherl:encode("http://userprimary.net/user/2009/06/20/writing-a-url-shortening-service-in-erlang-part-3/").
{ok,"6"}
4> sherl:decode("6").
{ok,"http://userprimary.net/user/2009/06/20/writing-a-url-shortening-service-in-erlang-part-3/"}
5> sherl:decode("abc").
{not_found,"abc"}
6> gen_server:call(sherl, blah).
ignored
7> application:stop(sherl).

=INFO REPORT==== 20-Jun-2009::22:19:02 ===
    application: sherl
    exited: stopped
    type: temporary
ok
8>

In the sherl.app resource file, we listed mnesia as an application that must be running for sherl to start. The application framework will verify that mnesia is running and error out before starting sherl if mnesia is not running. It will not, however, start mnesia for you which is why, in the demo above, I had to explicitly start mnesia before starting sherl — even though the sherl_db start function will start mnesia if needed.

So far we’ve written the core database module, sherl_db, tested it using Common Test, and packaged it into an OTP application. In the next installment, we’ll add the HTTP interface for the sherl URL shortening service using the webmachine framework.

UPDATE I realized that the initialization of sherl isn’t quite right when starting for the first time. After posting a question to the Erlang questions mailing list, I came up with the following adjustment to sherl_db.erl based on the helpful replies I received:

start(Nodes) ->
    case is_fresh_startup() of
        true ->
            case mnesia:system_info(is_running) of
                yes ->
                    error_logger:info_report("stopping mnesia"),
                    mnesia:stop();
                _ -> pass
            end,
            mnesia:create_schema(Nodes),
            error_logger:info_report("mnesia schema created"),
            error_logger:info_report("starting mnesia"),
            mnesia:start(),
            mnesia:create_table(url, [{disc_copies, Nodes},
                                      {attributes, record_info(fields, url)},
                                      {index, [url]}]),
            mnesia:create_table(counter, [{disc_copies, Nodes},
                                          {attributes,
                                           record_info(fields, counter)}]),
            error_logger:info_report("mnesia tables created");
        {exists, Tables} ->
            ok = mnesia:wait_for_tables(Tables, 20000)
    end.
 
%% @spec is_fresh_startup() -> true | false
%% @doc Returns true if mnesia has not been initialized with
%% the sherl schema.
%% Thanks to Dale Harvey for this function posted to
%% the erlang questions mailing list.
is_fresh_startup() ->
    Node = node(),
    case mnesia:system_info(tables) of
        [schema] -> true;
        Tbls ->
            case mnesia:table_info(schema, cookie) of
                {_, Node} -> {exists, Tbls};
                _                 -> true
            end
    end.

The issue was that since mnesia is listed as a required application in sherl.app, it needs to be started before the sherl application can start. But once mnesia is started, you cannot (easily) create a schema. The approach taken above is to detect this fresh start situation and stop mnesia in order to initialize the desired schema. Other approaches, which perhaps are more appropriate for a production environment, that were suggested on the list include using the builder module which will create a load script that could be customized or to make use of mnesia:change_config (although I believe this requires a separate running mnesia).

Writing a URL Shortening Service in Erlang, Part 2

June 18th, 2009

In part 1, we described our first pass at a module that interfaces to Mnesia to provide the core storage API for sherl, our URL shortening service.

Our code, at present, consists of a single file, sherl_db.erl. Since the url record will be useful to modules that call sherl_db, I extracted the definition into a separate url.hrl file and then setup the files to follow the OTP directory structure conventions. Here’s what we have:

sherl/
|-- Makefile
|-- ebin/
|-- include/
|   `-- url.hrl
|-- src/
|   `-- sherl_db.erl
`-- test/

We are going to use Common Test as a testing framework. Common Test comes with recent versions of Erlang, however, you will need to take an extra step to install the run_test script. Here’s what I did:

cd /usr/local/lib/erlang/lib/common_test-1.4.1
sudo ./install.sh local
sudo ln -s /usr/local/lib/erlang/lib/common_test-1.4.1/priv/bin/run_test \
           /usr/local/bin/run_test

Test files go in the test directory of the application and the Common Test (ct) convention is to create a file test/foo_SUITE.erl if you have a module named foo in your application. Tests in ct are just erlang functions that can do whatever you want. A successful test function is one that doesn’t blow up. Common Test was not designed as a unit testing framework, but you can use it as one. There is not, however, a collection of assert functions as one might expect when thinking of a traditional xUnit environment. If that is what you are looking for, have a look at EUnit. Personally, I don’t miss the assert functions as Erlang’s pattern matching makes for a powerful and light weight assertion mechanism.

Below is a minimal sherl_db_SUITE.erl. Some key features of a ct suite file are:

  • You must include ct.hrl.
  • Your life will be easier if you specify export_all.
  • You can define suite-wide test configuration in the suite() function. Below I’ve set a time limit of ten seconds for any test.
  • Setup and cleanup functions for the entire suite can be defined using init_per_suite and end_per_suite.
  • Tests are executed in the order specified by the all function unless you specify otherwise. This can be useful to orchestrate test scenarios that depend on sequence. You can also configure ct to run tests in random order, to repeat tests, and more.
-module(sherl_db_SUITE).
 
-compile(export_all).
-include("../include/url.hrl").
-include_lib("ct.hrl").
 
suite() -> [{timetrap,{seconds,10}}].
 
init_per_suite(Config) ->
    sherl_db:start([]),
    Config.
 
end_per_suite(_Config) ->
    sherl_db:stop(),
    ok.
 
all() -> 
    [unknown_url_is_undefined, roundtrip1, roundtrip2].
 
%% Test cases start here.
 
unknown_url_is_undefined(_Config) ->
    undefined = sherl_db:get_url("no-such-url-in-db"),
    ok.
 
roundtrip1(_Config) ->
    Url1 = "url1",
    Ans1 = sherl_db:get_code(Url1),
    Url1 = Ans1#url.url,
    1 = Ans1#url.code,
    Ans1 = sherl_db:get_url(Ans1#url.code),
 
    %% same URL yields same record
    Ans1 = sherl_db:get_code(Url1),
    ok.
 
roundtrip2(_Config) ->
    Url2 = "url2",
    Ans2 = sherl_db:get_code(Url2),
    Url2 = Ans2#url.url,
    2 = Ans2#url.code,
    Ans2 = sherl_db:get_url(Ans2#url.code),
    ok.

Here’s what this looks like on the shell for a successful run:

orange sherl (master)$ make test
run_test -pa /Users/seth/Dropbox/SideProj/sherl/lib/sherl/ebin -dir /Users/seth/Dropbox/SideProj/sherl/lib/sherl -logdir ct-results
Erlang R13B (erts-5.7.1) [source] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]
 
 
Common Test starting (cwd is /Users/seth/Dropbox/SideProj/sherl/lib/sherl)
 
Eshell V5.7.1  (abort with ^G)
(ct@orange)1> 
Common Test: Running make in test directories...
 
CWD set to: "/Users/seth/Dropbox/SideProj/sherl/lib/sherl/ct-results/ct_run.ct@orange.2009-06-18_21.49.33"
 
TEST INFO: 1 test(s), 6 case(s) in 2 suite(s)
 
Testing lib.sherl: Starting test, 6 test cases
 
=INFO REPORT==== 18-Jun-2009::21:49:35 ===
    application: mnesia
    exited: stopped
    type: temporary
Testing lib.sherl: TEST COMPLETE, 6 ok, 0 failed of 6 test cases
 
Updating /Users/seth/Dropbox/SideProj/sherl/lib/sherl/ct-results/index.html... done
Updating /Users/seth/Dropbox/SideProj/sherl/lib/sherl/ct-results/all_runs.html... done

I would like this setup much better if there was a way to have the console output be much more concise. I find the output as it is a bit difficult to read. The HTML reports that ct creates harken back to ye olde times in terms of the web design, but nevertheless provide nice summary and detail views of the tests. Another interesting feature of ct is that it saves a separate report for each run so you can review progress by examining test result history.

ct summary page

ct summary page

Now back to our regularly scheduled program. We’ve put in place a few very simple tests to verify that sherl_db:get_code and sherl_db:get_url work properly. One thing we haven’t tested is what happens when requests come in concurrently. We’ll define a test concurrent_creating that will spawn a number of processes each of which will call get_code for the same list of URLs.

concurrent_creating(_Config) ->
    NumClients = 5,
    Seq = lists:map(fun erlang:integer_to_list/1, lists:seq(1, 10)),
    Parent = self(),
    F = fun() ->
                Codes = lists:map(fun(N) ->
                                          sherl_db:get_code("http://" ++ N)
                                  end,
                                  Seq),
                Parent ! {self(), Codes}
        end,
    Pids = lists:map(fun(_X) -> spawn(F) end, lists:seq(1, NumClients)),
    Results = [ simple_gather(Pid) || Pid <- Pids ],
    Codes = [ X#url.code || X <- hd(Results) ],
    ExpectedCodes = lists:seq(3, 12),
    lists:foreach(fun(L) ->
                          ExpectedCodes = [X#url.code || X <- L]
                  end,
                  Results),
    ok.
 
simple_gather(Pid) ->
    receive
        {Pid, Val} ->
            Val
    end.

In the test, we define a fun, F, that uses lists:map to obtain the code for each URL in Seq and then sends the result to the parent process. The parent process spawns a process that runs F for each pseudo-client and then gathers the result by waiting for messages with the expected PIDs. The result we expect is that each client obtains the same codes for the URLs and that those codes should be in order. The ordering is perhaps debateable, but I’ve left it in for now.

Running this test the first time was very satisfying because it completely failed.

Testing lib.sherl: Starting test, 6 test cases
 
=ERROR REPORT==== 18-Jun-2009::22:07:15 ===
Error in process <0.158.0> on node 'ct@orange' with exit value: {{badmatch,{aborted,{{case_clause,[{url,21,"http://9",{1245,388035,266231}},{url,20,"http://9",{1245,388035,266190}},{url,19,"http://9",{1245,388035,266148}},{url,18,"http://9",{1245,388035,266103}}]},[{sherl_db,'-get_code/1-fun-0-',1},{mnesia_tm,apply_fun,3},{mnesia_tm,execute_transaction,5},{sherl_db... 
 
- - - - - - - - - - - - - - - - - - - - - - - - - -
sherl_db_SUITE:simple_gather failed on line 101
Reason: timetrap_timeout
- - - - - - - - - - - - - - - - - - - - - - - - - -
 
Testing lib.sherl: *** FAILED *** test case 6 of 6
Testing lib.sherl: TEST COMPLETE, 5 ok, 1 failed of 6 test cases

The HTML report from ct contains more detail. One nice feature is that it shows you where the error occured and gives you a link to a display of the source code with numbered lines. I was going to include some more screenshots, but my screen camera is out of film.

So what happened? Summary: I misunderstood mnesia transactions. Transactions in mnesia can run concurrently and the developer needs to add explicit read and/or write locking where appropriate. My initial mental model of transactions in mnesia imagined transactions as occurring serially. This isn’t the case, and for good reason. Multiple read-only transactions should obviously be able to run concurrently in a reasonable system.

Within a transaction, either a request to obtain a lock will succeed, in which case you have the lock for the remainder of the transaction (there does not seem to be any notion of explicit unlocking) or you fail to get the lock and the entire transaction is backed out and will be retried. Because transactions may be replayed a number of times, you need to be very careful that code within a transaction is free of side-effects else repeated executions may have undesired effects.

The first fix for the code, then, is to ask for a write lock for the table. Adding mnesia:lock({table, url}, write) to the top of the transaction code in get_code made the test pass. Thanks to some discussion on the erlang questions mailing list, I was made aware of another problem. With the lock in place, we will indeed get the expected behavior that only one request will create a new record for a given URL. However, if one of the transactions fails because it is unable to get a lock, then it may happen that next_int is called more often than we expect and we will end up skipping integer values in the codes we assign. This would happen if the call to mnesia:write bombs out. But since the write lock is acquired before next_int is called, I would expect that a failed lock would bail out of the transaction without executing all of the code. So this needs some further investigation.

To wrap up here are links to the complete sherl_db.erl and sherl_db_SUITE.erl files. Aside from the above bug fix, I’ve also made a few small changes to sherl_db to use mnesia:index_read and mnesia:read instead of qlc.

Writing a URL Shortening Service in Erlang, Part 1

June 15th, 2009

As a means of exploring Erlang, I decided to write a URL shortening service similar to the services you are likely familiar with like tinyurl, bit.ly, ur.ly, and tr.im. A URL shortener makes for a small, well-defined learning project that touches some important areas that are likely to come up in any real-world system, namely database storage and HTTP web services. This project uses the Mnesia database for storage and a light weight framework called webmachine, which sits on top of mochiweb, for building the RESTful HTTP API. In this post, I’ll describe the module that interfaces with mnesia.

At its core, our URL shortening service needs to map a URL to an integer, keep track of this mapping, and be able to return the original URL when given the integer code at a later point in time. The integer codes will be turned into short URLs by encoding them to make them more compact. I’ll describe how I went about that in another post. With the core storage system in place, the web layer will provide the user interface for creating short URLs and translating them back to their original long form.

One last thing before diving in. Surely, the project needs a name; let’s call it sherl.

The sherl database module needs to support two operations:

  1. get_code(Url): Given a URL, create or look up of the integer code.
  2. get_url(Code): Given an integer code, look up the original URL or indicate that the code is unknown.

For each URL we need to store the integer code and the URL. As a bonus, we will also store the creation time. Mnesia makes it easy to associate an Erlang record with an Mnesia table so we’ll start with the following definition:

-record(url, {code, url, created}).

Mnesia’s schema definition may seem surprisingly relaxed if one is used to relational database system. Essentially, you define a table and specify how many columns and their names and leave it at that. You can store any Erlang term in a given table cell.

Before we can start implementing either get_code or get_url, we need some supporting code to create the mnesia schema, start and stop mnesia, and we need the ability to generate an increasing sequence of integers (an auto increment feature).

-module(sherl_db).
-export([get_code/1, get_url/1, start/1, stop/0]).
-include_lib("stdlib/include/qlc.hrl").
 
-record(url, {code, url, created}).
-record(counter, {id = 0, ver = 1}).
 
start([]) ->
    start([node()]);
start(Nodes) ->
    mnesia:create_schema(Nodes),
    mnesia:start(),
    mnesia:create_table(url, [{disc_copies, Nodes},
                              {attributes, record_info(fields, url)},
                              {index, [url]}]),
    mnesia:create_table(counter, [{disc_copies, Nodes},
                                  {attributes, record_info(fields, counter)}]),
    mnesia:wait_for_tables([url, counter], 20000).
 
stop() ->
    mnesia:stop().
 
get_code(Url) -> implement_me.
 
get_url(Code) -> implement_me.
 
%% @spec next_int() -> integer()
%% @doc This is the integer sequence used to uniquely identify URLs
next_int() ->
    mnesia:dirty_update_counter(counter, id, 1).

Let’s start with start/1 that begins by creating the mnesia schema. The mnesia:create_schema/1 function will create a directory on disk and store information about the Erlang nodes that are a part of the database. Note that you call it before starting mnesia and that we do not give it any specifics about our data format. Mnesia’s schema are quite different from how the term is used in RDB systems. You control the directory where mnesia writes its schema information, as well as where local disk copies of tables will be stored by specifying a flag at Erlang start time:

-mnesia dir '"/your/dir/here"'

I haven’t found a way to set this programatically. For most production systems I don’t imagine that poses any problem, but for testing a system that uses mnesia it would be convenient to be able to control where mnesia looks for data. So if you know of a way to do this, please comment on this post.

Next we call create_table to define our tables, one for the url record and one for our integer sequence, more on that below. If the tables already exist, these calls will return a tagged tuple indicating an error, but no harm is done. Finally, we call wait_for_tables because if the tables do already exist and we are using a multi-node mnesia schema, we want to be sure the latest data is available on all nodes before allowing the API of this module to be called.

Creating the integer sequence required some searching, but ends up being very simple. I came upon this blog post that had a comment from Ulf Wiger which pointed me to the dirty_update_counter/3 function. The counter table will only contain a single row and will get updated to keep track of the integer sequence. One minor pitfall to be aware of is that mnesia tables must have at least two columns, hence the “ver” column the counter record definition.

What? Your still here? Alright, then. On to the main API. Let’s look at get_url first because it is simpler. I decided that both get_url and get_code should return a url record as this will keep all of the pertinent information together for clients.

%% @spec get_url(integer()) -> recUrl() | undefined
%% @type recUrl() = #url
%% @doc Return the the url record for the URL associated with the
%% specified integer ID.  The atom undefined is returned if the
%% specified URL is not found in the database.
get_url(Code) ->
    F = fun() ->
                Q = qlc:q([X || X <- mnesia:table(url), X#url.code =:= Code ]),
                case qlc:e(Q) of
                    []    -> undefined;
                    [Rec] -> Rec
                end
        end,
    {atomic, Val} = mnesia:transaction(F),
    Val.

The get_url function uses the qlc module’s list comprehension syntax to query for the url record with a matching code. For a gentle introduction to mnesia and performing queries using qlc, I highly recommend Programming Erlang.

Finally, we can wrap up this episode with get_code:

%% @spec get_code(string()) -> recUrl()
%% @type recUrl() = #url
%% @doc Store a URL in the database and return a url record with the
%% unique integer that will be permanently associated with the URL.
%% If the URL is already in the system, the existing url record is
%% returned.
get_code(Url) ->
    F = fun() ->
                Q = qlc:q([X || X <- mnesia:table(url), X#url.url =:= Url ]),
                Rec = qlc:e(Q),
                case Rec of
                    [] ->
                        Code = next_int(),
                        New = #url{url = Url, code = Code, created = now()},
                        mnesia:write(New),
                        New;
                    [Found] ->
                        Found
                end
        end,
    {atomic, Val} = mnesia:transaction(F),
    Val.

The get_code function requires a bit more logic since we need to first check whether we’ve already assigned an integer code for the specified URL, in which case we just return the record we find. Otherwise, we obtain an integer code and write a new record to the database. The check for existing and the possible write take place within the same transaction. This should avoid the potential race condition of the same long URL being requested concurrently. Note to self: it would be nice to come up with a way to test that.

That concludes part 1. I realize that I haven’t yet discussed the layout of the code on disk, how to build the code, nor how to test it. I will try to work that into one of the follow up posts.

[Update: checkout part 2 which has among other things, an important bug fix to the code presented here]

CouchDB for access log aggregation and analysis

June 13th, 2009

I spent this week exploring CouchDB as a backing store for a data aggregation service that we are building at work. The service needs to analyze web server access logs and provide aggregated views of the data over different time windows. For the purposes of this discussion, the analysis goal is to identify the most viewed pages. As each page has associated with it a list of tags, we also wish to rank the tags based on the page views.

There are many capable packages that will parse, analyze, and report on web server access logs and we are certainly not trying to reinvent the wheel. However, the specifics of our data and they way in which we wish to serve the results make the various off-the-shelf analysis packages inappropriate.

CouchDB is compelling because:

  • the document-based paradigm coupled with the map/reduce powered views makes experimenting with the data cheap and gives one hope that the system will support as yet to be defined requirements;
  • we can process data and then serve it up via HTTP for other services to consume using the same system in which the data is stored;
  • couchdb has a trigger-based replication mechanism that could be used in setting up a highly available service
  • If you take all the pillows off, you can make an awesome fort

After reading through How to handle stats aggregation on the CouchDB wiki, I decided to process the raw log data into one minute summaries rather than creating a document in couchdb for every log entry. The first step was to gather some archived access log data and split it into files representing one minute worth of requests. Then I wrote a script to summarize each one minute chunk into a list of unique pages and counts that could be inserted as documents into couchdb. In the production system, the idea is that the live access logs would be summarized into one minute chunks and integrated into the database in near real-time. Here’s an example document:

{
       "_id" : "2009-06-13T11h07_70af27180205e9dd37322fdaa92dd60e",
   "request" : "/the/url/for/this/request";
"view_count" : 120;
  "view_pct" : 0.15;
      "tags" : ["foo", "bar", "baz"];
}

I used the couchrest Ruby gem to insert the documents into couchdb. Next it was time to start writing some couchdb views to start looking at the data. I did a lot of experimenting using temporary views entered via the futon web interface. This eased the learning curve considerably because I was able to get rapid feedback on my first attempts to write couchdb map/reduce views.

We want to be able to see the most viewed pages for a specified one minute, one hour, one day, or one month interval. Ideally, I would like to have a view that allowed for more flexibility in the aggregation of time, for example, a view that would allow a user to specify a ten minute or four hour interval; I haven’t found a good solution for that yet.

But if you can live with predefined time intervals, one approach that seems to work is to define a map function that emits a key for each time interval and page ID. In the example below, the document IDs consist of a timestamp, coded like 2009-06-13T11h07, a separating underscore, and then the md5 digest of the page’s URL.

// map function for view count by minute, hour, day, or month
function(doc) {
    var doc_hash = doc._id.replace(/^.+_/, "");
    var date_str = doc._id.replace(/_.+$/, "").replace(/\-/g, "/").
        replace("T", " ").replace("h", ":") + " PDT";
    var dt = new Date(Date.parse(date_str));
    var date_key = [dt.getFullYear(), dt.getMonth() + 1,
                    dt.getDate(), dt.getHours(), dt.getMinutes()];
    var date_keys = {
        "M" : date_key.join("-"),
        "H" : date_key.slice(0, 4).join("-"),
        "D" : date_key.slice(0,3).join("-"),
        "m" : date_key.slice(0,2).join("-")
    };
    for (t in date_keys) {
        emit([t, date_keys[t], doc_hash], doc.view_count);
    }
}
 
// reduce function
function(keys, values, rere) {
    return sum(values);
}

Given a document with ID “2009-06-13T10h31_abc”, the map function will emit keys:

["M", "2009-6-13-10-31", "abc"]
["H", "2009-6-13-10", "abc"]
["D", "2009-6-13", "abc"]
["m", "2009-6", "abc"]

One can then query this view for a specific time interval using the group, startkey, and endkey parameters. For example, to get a summary of page views on June 12, 2009 between 10:00 and 11:00, issue a query with startkey=["H", "2009-6-12-10", true], endkey=["H", "2009-6-12-10", {}] and group=true. Based on my experiments and reading of the view collation rules, this will return all of the hour-based keys that look like “2009-6-12-10″ (the hour between 10 and 11). The reduce function and group=true will sum the view counts for each unique page ID.

What this view does not do is return the results sorted by page views. The lack of an ability to sort results based on a computed value was surprising to me and has me reconsidering whether or not couchdb is appropriate for our needs. For now we can get around this by ask for all results for the view and doing the sorting on the client side. Another missing piece is that CouchDB currently lacks a notion of chaining map/reduce views together. If we could run one more map on the result of this view, we could easily sort by the view_count. The Erlang-based client hovercraft already has support for view chaining by inserting the results of a view into a temporary database. There has also been some recent discussion on the couchdb developer mailing list about possible approaches to support view chaining in general.

Creating a view for the tags follows similarly by emitting a row for each timestamp and tag combination. Another approach would be to create a separate view for each time unit instead of relying on the array-valued keys and prefixes. Some experiments are in order to understand further the pros and cons of a single view vs seprate views. One could even take the one view to rule them all approach further and add the tag rows to the same view by prepending another element into the key array to identify the tag rows.

Overall, I’ve been impressed with CouchDB’s ease of use and the performance appears to be quite usable for our purposes, although I haven’t yet done any formal measurements. I do have some reservations, however, that our use cases differ significantly from those that drive couchdb development.