WSE Tutorial

To test WSE, we need a test WARC files to play with, or at least one.
Fortunately, Internet Archives freely offers bunch of them at:

Free of charge WARC files
Just grab one:
$ wget http://archive.org/download/testWARCfiles/WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz
c:\> wget.exe http://archive.org/download/testWARCfiles/WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz


Index One WARC File

We assume all the settings are done as described in Requirements section.
So, 02 Elastic Search instances are up & running, and form our cluster.
In a terminal, run an Erlang node to index the above WARC file:
1
2
3
4
5
6
7
c:\> werl

%% start the WSE server
1> ok = wse:start().

%% index the WARC file using the remote "Elastic Search" API
2> ok = wse:index("WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz").
This call to wse:index/1 is sychnoronous. It’ll block until the indexing process terminates.
When done, an empty file with extention .done will be created next to the original WARC.
In our example, you’ll get:
WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz.done



2. Index Many Many WARC Files

This gem will help us index all WARC files inside a directory.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
%% copy the following into a file called: "wse_dir.erl"
-module(wse_dir).
-export([index/1]).

index(Dirname) ->
  %% match all WARC files (plain + GZIP) in directory Dirname
  WarcList = filename:join(Dirname, "*.warc*"),

  %% index all the WARCs one by one
  lists:foreach (fun(WARC) ->
                        ok = wse:index(WARC)
                end, filelib:wildcard(WarcList)).
Compile:
c:\> erlc wse_dir.erl
Then run:
c:\> werl -s wse start
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

%% index all WARC files in the current directory (can take a while)
1> wse_dir:index("./").

%% quit the Erlang VM
2> q().

That’s it. Indexing WARC files for fulltext search is as easy as that!

Recap

  • Lines 1,2: declare a new module called wse_dir which exports one public function called index/1.

  • The function index takes only one argument, the WARCs directory name.

  • Line 7: list all WARC files inside Dirname matching the regular expression *.warc*.
    This will return both plain WARCs (.warc), and compressed (.warc.gz) ones .
  • Lines 10,12: we iterate through WARC files one by one and index them synchronously.



How to index WARC files at full speed?

Use wse:async_index/1 call instead of wse:index/1. But before indexing a WARC,
test if enough room is available in the connections pool by calling wse:status/0.


3. Search Your WARC Files

At this point, all WARCs were successfully indexed in Elastic Search.
Now, use your favourite programming language to search them.
Here are some examples with cURL:
$ curl -XGET 'http://127.0.0.1:9200/wse_sample/_search?q=Webdiyer'

$ curl -XGET 'http://127.0.0.1:9200/wse_sample/_search?q=warcs.uri:www.theliberatorfiles.com&df=data.content'

$ curl -XGET 'http://127.0.0.1:9200/wse_sample/_search?q=Vitesse+du+vent&from=0&size=1'

$ curl -XGET 'http://127.0.0.1:9200/wse_sample/_search?q=web*&analyze_wildcard=true&df=warcs.data'

$ curl 'http://127.0.0.1:9200/wse_sample/_search?q=web*&analyze_wildcard=true&df=warcs.data&fields=uri,uuid,warc_name'
Reqest Elastic Search with cURL


Elastic Search also offers a WebUI via the head plugin (see Requirements).
Elastic Search WebUI with Head plugin