$ wget http://archive.org/download/testWARCfiles/WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz
c:\> wget.exe http://archive.org/download/testWARCfiles/WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz
1 2 3 4 5 6 7 | c:\> werl
%% start the WSE server
1> ok = wse:start().
%% index the WARC file using the remote "Elastic Search" API
2> ok = wse:index("WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz").
|
WIDE-20110225183219005-04371-13730~crawl301.us.archive.org~9443.warc.gz.done
1 2 3 4 5 6 7 8 9 10 11 12 | %% copy the following into a file called: "wse_dir.erl"
-module(wse_dir).
-export([index/1]).
index(Dirname) ->
%% match all WARC files (plain + GZIP) in directory Dirname
WarcList = filename:join(Dirname, "*.warc*"),
%% index all the WARCs one by one
lists:foreach (fun(WARC) ->
ok = wse:index(WARC)
end, filelib:wildcard(WarcList)).
|
c:\> erlc wse_dir.erl
c:\> werl -s wse start
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1 (abort with ^G)
%% index all WARC files in the current directory (can take a while)
1> wse_dir:index("./").
%% quit the Erlang VM
2> q().
That’s it. Indexing WARC files for fulltext search is as easy as that!
Lines 1,2: declare a new module called wse_dir which exports one public function called index/1.
The function index takes only one argument, the WARCs directory name.
Line 7: list all WARC files inside Dirname matching the regular expression *.warc*.This will return both plain WARCs (.warc), and compressed (.warc.gz) ones .Lines 10,12: we iterate through WARC files one by one and index them synchronously.
How to index WARC files at full speed?
$ curl -XGET 'http://127.0.0.1:9200/wse_sample/_search?q=Webdiyer'
$ curl -XGET 'http://127.0.0.1:9200/wse_sample/_search?q=warcs.uri:www.theliberatorfiles.com&df=data.content'
$ curl -XGET 'http://127.0.0.1:9200/wse_sample/_search?q=Vitesse+du+vent&from=0&size=1'
$ curl -XGET 'http://127.0.0.1:9200/wse_sample/_search?q=web*&analyze_wildcard=true&df=warcs.data'
$ curl 'http://127.0.0.1:9200/wse_sample/_search?q=web*&analyze_wildcard=true&df=warcs.data&fields=uri,uuid,warc_name'