1. Overview¶

Web Archives quality is a broad subject and probably one of the biggest challenges encountered so far.
Even though quality of capture is related to the tools and techniques used, link extraction is the process
that affects it the most.

Links discovery in web archiving consists in extracting links and resources of a given website so that
they could be fed to the crawler and initiate their capture by the latter, therefore making it the most crucial
step in the capture sequence. It’s also the most decisive one as most of the result’s quality relies on this step.

The UXTR project consists of a RESTful Web Service and API to discover and extract links, therefore improving

the quality of capture by feeding more URLs to the crawler.

UXTR has been tested and is known to run on Windows, Linux, and Mac OS X ... 32/64-bit.

It’s actually embedded in KEN, a cross-platform web crawler.

Immediate benefits are:

Cross-platform: operate unchanged on Windows, Mac OSX, and Linux 32/64-bit
Proxy Passthrough: support any HTTP proxy like Squid, TinyProxy, Privoxy, etc.
Links Categorization: in, out, and embeds
Links Deduplication: on the fly (via Bloom Filters)
Parallel Processing: parsing of multiple simultaneous pages
Internal Cache Awareness: freshness, validation, and invalidation
Customizable Scripting System: can be scripted to optimize

links extraction for specific pages, platforms (Wikis, Twitter, ...)

and/or technologies (ASP, JSP, ...)
TCP & RESTful APIs: stream API support
Crash Resilient: heartbeat protocol for monitoring
Standalone: require no dependencies

erlang logo

1. Overview¶

Table Of Contents

API Reference

Navigation

1. Overview¶

Table Of Contents

API Reference

Navigation