1. OverviewΒΆ

Web Archives quality is a broad subject and probably one of the biggest challenges encountered so far.
Even though quality of capture is related to the tools and techniques used, link extraction is the process
that affects it the most.
Links discovery in web archiving consists in extracting links and resources of a given website so that
they could be fed to the crawler and initiate their capture by the latter, therefore making it the most crucial
step in the capture sequence. It’s also the most decisive one as most of the result’s quality relies on this step.
The UXTR project consists of a RESTful Web Service and API to discover and extract links, therefore improving
the quality of capture by feeding more URLs to the crawler.
UXTR has been tested and is known to run on Windows, Linux, and Mac OS X ... 32/64-bit.
It’s actually embedded in KEN, a cross-platform web crawler.
Immediate benefits are:
  • Cross-platform: operate unchanged on Windows, Mac OSX, and Linux 32/64-bit
  • Proxy Passthrough: support any HTTP proxy like Squid, TinyProxy, Privoxy, etc.
  • Links Categorization: in, out, and embeds
  • Links Deduplication: on the fly (via Bloom Filters)

  • Parallel Processing: parsing of multiple simultaneous pages
  • Internal Cache Awareness: freshness, validation, and invalidation
  • Customizable Scripting System: can be scripted to optimize
    links extraction for specific pages, platforms (Wikis, Twitter, ...)
    and/or technologies (ASP, JSP, ...)
  • TCP & RESTful APIs: stream API support
  • Crash Resilient: heartbeat protocol for monitoring
  • Standalone: require no dependencies

erlang logo