Overview
The
WARC Search Engine (shortly WSE) is a scalable Erlang server that lets you index all your
WARC files in a distributed manner.
WSE uses
Elastic Search as a default backend to ensure a
linear scalability.
Whether you have ten, a thousand, or million WARC files, WSE will let you index them all in parallel on multiple nodes.
For maximum performance, not only the indexing process is parallel between WARCs, but also inside a WARC itself
by indexing multiple WARC-Records at a time.
Moreover,
WSE supports plain, and compressed WARC files, thanks to
WSDK.
Immediate benefits for your programs are:
- Linear scalability
- Apache Lucene search based capabilities
- Built-in backpressure support on connections that are indexing too fast
- A simple yet intuitive API (02 function calls)
- Can be debugged and fixed while running (no downtime)
- All features available through a RESTful JSON API
- No Single Point of Failure (SPOF)
- Apache Tika for the detection and the extraction of metadata and
structured text content from HTML, PDF, Word, PPT, etc.
It is provided with a simple API, examples, and support.
WSE is Standalone (32/64-bit), multicore-aware, portable (Linux, Windows, OSX, etc.), and requires no external dependency.