Today, we’ll talk about SURT.
Sort-friendly URI Reordering Transform (SURT) as described in the doc, converts URIs of the form: scheme://firstname.lastname@example.org:port/path?query#fragment into scheme://(tld,domain,:port@userinfo)/path?query#fragment
The idea was good (see above for more details) and we decided to implement the SURT + Canonicalization algorithms to support legacy WARC/CDX files in the upcoming Aleph’s playback tool COBALT (in Erlang).
While doing so, we conducted some side by side test comparison to make sure the two implementations’ results match. To do so, we used the CDX files generously provided by Internet Archives.
Please note that we are only interested in CDX files as they provide the SURT version of each archived URL.
We were really surprised to find lot of gotchas, some of them are listed below:
|SURT + Canonicalization (truncated ///)|
|SURT + Canonicalization (magic &)|
|Original URL||URL http://www.viciouscycleworks.us/khxc/index.php?
|SURT + Canonicalization (unsorted arguments: “href” before “url”)|
|SURT + Canonicalization (missing /)|
We found thousands of inconsistencies like these.
Consequences: the SURT algorithm is central in the webarchiving replay infrastructure.
The wayback machine indexes billions of URLs using this algorithm in files known as “index files” (CDX) that can rapidly grow to millions of lines (terabytes in size).
The problem when a bug like the one outlined in this post appears, is that fixing the bug is not enough as thoushands of index files must be regenerated (a costly operation).
That leaves the users of this API with two options, a real dilemma:
1. fix the Java API and regenerate the CDX files (assuming the cost/time of the operation)
2. stick with the buggy API and its consequences: missing resources, bad dublicates detection of URLs, etc.
Learned lesson: test your code/algorithms carefully, especially when they will be used as a base in other programs with high impact.
by Younès HAFRI