Where are my (web) archives? SURT SURT to the rescue

Today, we’ll talk about SURT.

Sort-friendly URI Reordering Transform (SURT) as described in the doc, converts URIs of the form: scheme://userinfo@domain.tld:port/path?query#fragment into scheme://(tld,domain,:port@userinfo)/path?query#fragment

The idea was good (see above for more details) and we decided to implement the SURT + Canonicalization algorithms to support legacy WARC/CDX files in the upcoming Aleph’s playback tool COBALT (in Erlang).

While doing so, we conducted some side by side test comparison to make sure the two implementations’ results match. To do so, we used the CDX files generously provided by Internet Archives.

Please note that we are only interested in CDX files as they provide the SURT version of each archived URL.

We were really surprised to find lot of gotchas, some of them are listed below:

SURT + Canonicalization (truncated ///)
Original URL http://api.buzzurl.jp/api/counter/http:///www.e-research.biz/statistics/sta_13/002740.html
Expected jp,buzzurl,api)/api/counter/http://www.e-research.biz/statistics/sta_13/002740.html
Wayback Machine jp,buzzurl,api)/api/counter/http:/www.e-research.biz
/statistics/sta_13/002740.html
COBALT jp,buzzurl,api)/api/counter/http://www.e-research.biz/statistics/sta_13/002740.html

 

SURT + Canonicalization (magic &)
Original URL URL http://www.viciouscycleworks.us/khxc/index.php?
app=ccp0&ns=catshow&ref=kawasaki&prodsort=PRICEUP&
sid=ze2509b1v54ldiu17199jlcf8n88cquz
Expected us,viciouscycleworks)/khxc/index.php?
app=ccp0&ns=catshow&prodsort=priceup&ref=kawasaki
Wayback Machine us,viciouscycleworks)/khxc/index.php?&
app=ccp0&ns=catshow&prodsort=priceup&ref=kawasaki
COBALT us,viciouscycleworks)/khxc/index.php?
app=ccp0&ns=catshow&prodsort=priceup&ref=kawasaki

 

SURT + Canonicalization (unsorted arguments: “href” before “url”)
Original URL http://www.highschoolmusicalgames.us/frame/frame.php?url=www.simpy.com/simpy/LinkAdd.do?href=www.highschoolmusicalgames.us/play-flash-7593.html-Naruto%20Games%20Sakura%20dress
Expected us,highschoolmusicalgames)/frame/frame.php?href=www.highschoolmusicalgames.us/play-flash-7593.html-naruto%20games%20sakura%20dress?url=www.simpy.com/simpy/linkadd.do
Wayback Machine us,highschoolmusicalgames)/frame/frame.php?url=www.simpy.com/simpy/linkadd.do?href=www.highschoolmusicalgames.us/play-flash-7593.html-naruto%20games%20sakura%20dress
COBALT us,highschoolmusicalgames)/frame/frame.php?href=www.highschoolmusicalgames.us/play-flash-7593.html-naruto%20games%20sakura%20dress?url=www.simpy.com/simpy/linkadd.do

 

SURT + Canonicalization (missing /)
Original URL http://www.polish-online.com/wordpress/?p=187
Expected com,polish-online)/wordpress/?p=187
Wayback Machine com,polish-online)/wordpress?p=187
COBALT com,polish-online)/wordpress/?p=187

We found thousands of inconsistencies like these.

Consequences: the SURT algorithm is central in the webarchiving replay infrastructure.
The wayback machine indexes billions of URLs using this algorithm in files known as “index files” (CDX) that can rapidly grow to millions of lines (terabytes in size).
The problem when a bug like the one outlined in this post appears, is that fixing the bug is not enough as thoushands of index files must be regenerated (a costly operation).

That leaves the users of this API with two options, a real dilemma:
1. fix the Java API and regenerate the CDX files (assuming the cost/time of the operation)
2. stick with the buggy API and its consequences: missing resources, bad dublicates detection of URLs, etc.

Learned lesson: test your code/algorithms carefully, especially when they will be used as a base in other programs with high impact.

by Younès HAFRI