WSDK Tutorial

WSDK lets you build better quality Web Archiving software in less time.
This tutorial guide will walk you through some basic examples of that process.
We are done with “Hello Web Archiving” example here - that’s way too easy.
Let’s move for something more interesting.
First, we need some WARC files to play with.
Fortunately, Internet Archives freely offers bunch of them at:

Free of charge WARC files
Just grab one:
$ wget http://archive.org/download/testWARCfiles/WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz
c:\> wget.exe http://archive.org/download/testWARCfiles/WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz


1. Count WARC records

In this first gem, we’ll use some of the WSDK’s primitives to count the number of records inside the WARC file we’ve just downloaded.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
%% copy the following into a file called: "count_records.erl"
-module(count_records).
-export([from_warc/1]).

from_warc(Filename) ->
  {ok, _, Handle} = wsdk_warc:read(Filename),
  count(Handle, 1).              %% start counting from 1

count(Handle, Cnt) ->
  case wsdk_warc:read(Handle) of %% read next WARC record if any!
    {ok, _, Handle1} ->
       count(Handle1, Cnt + 1);  %% found a new record, increment Cnt
    eof ->
       wsdk_warc:close(Handle),  %% we're done, close the WARC file handle
       Cnt
end.
Compile:
c:\> erlc count_records.erl
Then run:
c:\> werl -s wsdk start
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

1> count_records:from_warc("WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz").

64188

2> q().

We found 64188 WARC records.

Recap

  • Lines 1,2: declare a new module called count_records which exports one public function called from_warc/1.

  • The function from_warc takes only one argument, a WARC filename.

  • Line 6: open the WARC file in variable Filename through the call to wsdk_warc:read/1. The first call to this function takes
    a Filename as an argument. Subsequent calls will take a filehandle (line 10).
    In case of success, we get back the tuple: {ok, Record, Handle}
    As you can see, we’re only interested in counting records here. Thus, we used the _ placeholder to ignore the record value:
    {ok, _, Handle}
  • Line 9: is the function body’s of the counting (recursive) loop.

    We call the same function wsdk_warc:read/1 but this time with a file “Handle” (not a filename).
    In case of success, we increment the Cnt variblae which holds the number of records found so far.
  • Line 13: indicates the end of WARC. We’re done and there’s no more records to count. Just return Cnt.



10 lines of code ... too much for just counting

These 10 lines are doing much more than just counting WARC records.
Can you spot why?



2. How about plain WARCs?

The previous gem was tested with a GZIP compressed WARC file.
Can we do the same on plain (i.e uncompressed) WARC files?

Let’s see!

First, make an uncompressed version of this WARC (do not alter the original compressed WARC):
$ gzip -dc WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz > WIDE-20110225183219005.warc
Now, run your program:
$ erl -s wsdk start
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

1> count_records:from_warc("WIDE-20110225183219005.warc").

64188

2> q().

The result is the same: 64188 records found.



Exercise

To keep this gem easy to read, we didn’t handle exceptions while parsing broken/invalid WARC files.
In general, Erlang programers don’t care too much about crashes, and most of them adhere to the let it crash philosophy.
However, how can we make this code more reliable?



3. Is it valid?

This third gem will help us answer a simple question:
is the first record in this WARC valid (or not)?

Let see how to proceed:

1
2
3
4
5
6
7
8
9
%% copy the following into a file called: "record_status.erl"
-module(record_status).
-export([valid_or_not/1]).

valid_or_not(Filename) ->
  {ok, Record, Handle} = wsdk_warc:read(Filename),
  Status = wsdk_record:is_valid(Record), %% check record's validity
  ok = wsdk_warc:close(Handle),
  Status =:= ok.
Compile:
$ erlc record_status.erl
Then run:
$ erl -s wsdk start
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

1> record_status:valid_or_not("WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz").

true

2> q().

So yes, the first record is a valid WARC record.

Recap

  • Line 6: open the WARC file in variable Filename.
  • Line 7: call wsdk_record:is_valid/1 to ensure both syntactic and semantic validity.
  • Line 9: return the result.



Exercise

This time, try to check if the last record is also valid?
Hint: use the end of WARC file tag eof (see 1. Count WARC records).



4. Smallest vs Biggest

In this gem, we’ll try to find out the smallest (resp. the biggest) WARC record.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
%% copy the following into a file called: "small_big.erl"
-module(small_big).
-export([find/1]).

%% useful field selectors
-include_lib("wsdk/include/wsdk.hrl").

%% macro which returns the record's length
-define(LENGTH(Record), wsdk_record:get(Record, ?'Content-Length')).

find(Filename) ->
  {ok, Record, Handle} = wsdk_warc:read(Filename),
  find(Handle, Record, Record).

find(Handle, Small, Big) ->
  SmallLength = ?LENGTH(Small),
  BigLength   = ?LENGTH(Big),
  case wsdk_warc:read(Handle) of
    {ok, Record, Handle1} ->
       Length = ?LENGTH(Record), %% current record size
       if
          Length > BigLength   -> find(Handle1, Small,  Record);
          Length < SmallLength -> find(Handle1, Record,    Big);
          true                 -> find(Handle1, Small,     Big)
       end;
    eof ->
       ok = wsdk_warc:close(Handle),
       [{smallest, SmallLength}, {biggest, BigLength}]
end.
Compile:
c:\> erlc small_big.erl
Then run:
c:\> werl -s wsdk start
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

1> small_big:find("WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz").

[{smallest,52},{biggest,222794980}]

2> q().
Answers: [{smallest, 52},{biggest, 222794980}]
Easy, isn’t it?
222794980 bytes, hmmm maybe a video?

Recap

The body of this new gem is very similar to one at 1. Count WARC records.

  • Line 15: this time, the find/3 method takes two arguments of type Record: Small and Big.
    Both are initialized with the first record found in the WARC (line 13).
  • Line 22: if the current record’s size (Length) is greater than Big, swap them and recurse.
  • Line 23: if the current record’s size (Length) is lesser than Small, swap them and recurse.
  • Line 24: otherwise, move to the next record.
  • Line 28: pretty print the result.


Exercise 1

Also print compressed start offset of each of them.
Hint: use the ‘Start-Offset’ selector (see wsdk.hrl)

Exercise 2

Handle the case of two or more records with the same size.
Hint: use Erlang Lists.

Exercise 3

The 52 bytes record corresponds to a useless DNS entry (“dns:git.anomos.info”).
How to avoid DNS records in the above program?
Hint: filter them out by checking the ?’WARC-Target-URI’ field (see wsdk.hrl)

5. A record somewhere

In this gem, we’ll dump some information about a WARC record at a specific offset.
Let’s assume we know this offset beforehand: 5970260
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
%% copy the following into a file called: "record_info.erl"
-module(record_info).
-export([print/2]).

%% useful field selectors
-include_lib("wsdk/include/wsdk.hrl").

%% macro for field selection
-define(FIELD(Record, Selector), wsdk_record:get(Record, Selector)).

print(Filename, Offset) ->
  {ok, Record, Handle} = wsdk_warc:read(Filename, Offset), %% open the WARC and move to an offset
  Info = [
          {type, ?FIELD(Record,       ?'WARC-Type')},
          {date, ?FIELD(Record,       ?'WARC-Date')},
          {vsn,  ?FIELD(Record,    ?'WARC-Version')},
          {id,   ?FIELD(Record,  ?'WARC-Record-ID')},
          {uri,  ?FIELD(Record, ?'WARC-Target-URI')}
         ],
  ok = wsdk_warc:close(Handle),
  Info.
Compile:
$ erlc record_info.erl
Then run:
$ erl -s wsdk start
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

1> record_info:print("WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz", 5970260).

[{type,<<"response">>},
 {date,{{2011,2,25},{18,33,19}}},
 {vsn,<<"1.0">>},
 {id,<<"<urn:uuid:7c8beabf-0cae-47bd-928a-0625fbe5a306>">>},
 {uri,<<"http://sotis-it.ru/index.php?option=com_fireboard&Itemid=79&func=view&catid=19&id=1708">>}]

2> q().

Recap

  • Line 12: this time, we open the WARC file and move to a specific offset immediately.
    Offset in this case is a compressed one. WSDK supports both compressed and uncompressed offsets.
  • Lines 14,18: we retrieve some information about the record at this offset.
  • Line 20: the result is returned.



Exercise

Try to print more information about this record?
Hint: use the selectors in wsdk.hrl



6. Extracting payload

This gem will help us extract the payload of a particular WARC record at offset: 4781045
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
%% copy the following into a file called: "record_payload.erl"
-module(record_payload).
-export([dump/2]).

dump(Filename, Offset) ->
  {ok, Record, Handle} = wsdk_warc:read(Filename, Offset), %% open the WARC and move to an offset
  dump(Record),
  ok = wsdk_warc:close(Handle).

dump(Record) ->
  case wsdk_record:payload(Record) of
   {ok, Chunk, Record1} -> %% got a chunk, print it out
      io:format("~p", [Chunk]),
      dump(Record1);
   eof -> %% no more chunks to read from the payload
      ok
end.
Compile:
$ erlc record_payload.erl
Then run:
$ erl -s wsdk start
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

1> record_payload:dump("WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz", 4781045).

<<"GET /athletics/events/volleyball-hollins-trimatch HTTP/1.0\r\nUser-Agent: Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)\r\nConnection: close\r\nReferer: http://www.salem.edu/sitemap\r\nHost: www.salem.edu\r\n\r\n">>

2> q().

So, this is an HTTP GET request.

Recap

  • Line 11: the call to the function wsdk_record:payload/1 returns the record’s payload chunk by chunk.
  • Line 12: this is the most secure way to deal with big payloads in RAM.
  • Line 14: try to get the next payload chunk if any.
  • Line 15: the call returns the tag eof, we’re done.



Exercise

Instead of dealing with the payload in RAM, dump it to disk.
Hint: create a unique temporary file with wsdk_file:mktemp/0 and write to it using file:write/2 module.



7. Create my own WARCs

In this gem, we’ll create two WARC records from scratch and put them into a new plain WARC file on disk.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
%% copy the following into a UTF-8 encoded file called: "new_warc.erl"
-module(new_warc).
-export([create/1]).

%% useful field selectors
-include_lib("wsdk/include/wsdk.hrl").

create(Filename) ->
   %% Chinese: wsdk_utf8:to_binary([35486,25991,25945,23416,12539,35821,25991,25945,23398])
   {ok, UTF8Filename} = wsdk_utf8:to_binary("語文教學・语文教学"),

   Record1 = wsdk_record:new([ %% WARC Info record
           {?'WARC-Type',      'warcinfo'},
           {?'WARC-Version',   '1.0'},
           {?'WARC-Date',      {{2012,8,17},{22,57,14}}},
           {?'WARC-Filename',  UTF8Filename},
           {?'WARC-Record-ID', <<"urn:uuid:35f02b38-eb19-4f0d-86e4-bfe95815069c">>}
          ]),

   Record2 = wsdk_record:new([ %% WARC Response record
           {?'WARC-Version',   '0.17'}, %% notice the version number 0.17
           {?'Content-Type',   <<"application/http; msgtype=response">>},
           {?'WARC-Date',      {{2012,8,18},{1,25,33}}},
           {?'WARC-Record-ID', wsdk_uuid:urn()}, %% generate a URN
           {?'WARC-Type',      'response'},
           {?'Payload-Type',   'bytes'},
           {?'Payload-Source', <<"this is a nice payload">>}
          ]),

   {ok, WARC}  = wsdk_warc:write(Filename),       %% create a new plain WARC file

   {ok, WARC1} = wsdk_warc:write(WARC,  Record1), %% write Record1
   {ok, WARC2} = wsdk_warc:write(WARC1, Record2), %% write Record2

   ok = wsdk_warc:close(WARC2).
Compile:
c:\> erlc new_warc.erl
Then run:
c:\> werl -s wsdk start
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

1> new_warc:create("foo.warc").

2> q().

Have a look to the newly generated WARC file foo.warc in the current directory:

WARC/1.0
Content-Length: 0
WARC-Type: warcinfo
WARC-Date: 2012-08-17T22:57:14Z
WARC-Record-ID: urn:uuid:35f02b38-eb19-4f0d-86e4-bfe95815069c
WARC-Filename: 語文教學・语文教学



WARC/0.17
Content-Length: 22
WARC-Type: response
WARC-Date: 2012-08-18T01:25:33Z
WARC-Record-ID: urn:uuid:f8c7505b-95f2-11e1-7163-3ae800000024
Content-Type: application/http; msgtype=response

this is a nice payload

Recap

  • Line 10: UTF-8 binary string 語文教學・语文教学 is created to set the ‘WARC-Filename’ field (line 16).
    This call to wsdk_utf8:to_binary/1 ensures UTF-8 validity.
  • Line 12: the first record Record1 of type warcinfo and version 1.0 by calling wsdk_record:new/1.
    This record has no payload (‘Content-Length’ is set 0 automatically).
  • Line 20: a second record Record2 of type response and version 0.17 is created.
    This time, the WARC-Record-ID is automatically generated with the call to wsdk_uuid:urn/0 (line 23).
    Moreover, a payload is set to: this is a nice payload.
    Finally, WSDK automatically handles the payload length for you (Content-Length: 22).
  • Line 30: an empty non-compresed WARC file is created.
  • Lines 31,32: the two previous records are written uncompressed.


Exercise 1

Let’s say we made a mistake and want to change Record2’s version to 1.0.
Hint: use the “set” call (see wsdk_record:set/3).

Exercise 2

Instead of a hard-coded binary payload, attach one from a file to Record2.
Hint: use payload of type ‘file’ (see wsdk.hrl).



8. MIME-Type statistics

This one is pretty useful. We’ll iterate over all the WARC records and classify them by MIME-Type.
To keep the code small and easy to follow, only 5 MIME categories are considered here:
  1. text
  2. image
  3. audio
  4. video
  5. other
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
%% copy the following into a file called: "record_mime.erl"
-module(record_mime).
-export([stats/1]).

%% useful field selectors
-include_lib("wsdk/include/wsdk.hrl").

stats(Filename) ->
  {ok, Record, Handle} = wsdk_warc:read(Filename),
  Mime = wsdk_record:get(Record, ?'Content-Type'),
  Categories = stats(Handle, match(Mime, {0,0,0,0,0})),
  ok = wsdk_warc:close(Handle),
  Categories.

stats(Handle, Categories) ->
  case wsdk_warc:read(Handle) of %% read next WARC record if any!
    {ok, Record, Handle1} ->
          Mime = wsdk_record:get(Record, ?'Content-Type'),
          stats(Handle1, match(Mime, Categories));
      eof ->
          Categories %% no more records, return categories!
  end.

 %% refine (add more) match patterns below to be more precise
 match(<<"text/", _/binary>>,  {Text, Image, Audio, Video, Other}) ->
   {Text + 1, Image, Audio, Video, Other};
 match(<<"image/", _/binary>>, {Text, Image, Audio, Video, Other}) ->
   {Text, Image + 1, Audio, Video, Other};
 match(<<"audio/", _/binary>>, {Text, Image, Audio, Video, Other}) ->
   {Text, Image, Audio + 1, Video, Other};
 match(<<"video/", _/binary>>, {Text, Image, Audio, Video, Other}) ->
   {Text, Image, Audio, Video + 1, Other};
 match(_, {Text, Image, Audio, Video, Other}) ->
   {Text, Image, Audio, Video, Other + 1}.
Compile:
$ erlc record_mime.erl
Then run:
$ erl -s wsdk start
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

1> record_mime:stats("WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz").

{26,0,0,0,64162}

2> q().
So, 26 as Text and 64162 classified as Other.
You can get better result by adding more specific patterns:

Recap

  • Line 11: set the Categories to {0,0,0,0,0} as initial value.

  • Line 25: enter Erlang’s Pattern Matching. A powerful way to match binary’s subparts, in this case MIME-Type.
  • Line 25: if the MIME-Type starts with <<”text/”>>, increment Text category.

  • Line 27: if the MIME-Type starts with <<”image/”>>, increment Image category.

  • ...

  • Line 33: this one is a catch all, we increment Other category.



Exercise 1

PDF files are important and we want to count them separately.
Hint: create a 6th category with MIME application/pdf

Exercise 2

We want to consider all MIME-Types and not only the 5 categories above.
Hint: use the Erlang dict module (i.e hashtable).