1. Overview
Web Archives quality is a broad subject and probably one of the biggest challenges encountered so far.
Even though quality of capture is related to the tools and techniques used, link extraction is the process
that affects it the most.
Links discovery in web archiving consists in extracting links and resources of a given website so that
they could be fed to the crawler and initiate their capture by the latter, therefore making it the most crucial
step in the capture sequence. It’s also the most decisive one as most of the result’s quality relies on this step.
The UXTR project consists of a RESTful Web Service and API to discover and extract links, therefore improving
the quality of capture by feeding more URLs to the crawler.
UXTR has been tested and is known to run on Windows, Linux, and Mac OS X ... 32/64-bit.
It’s actually embedded in
KEN, a cross-platform web crawler.