Where are my (web) archives? SURT SURT to the rescue

Today, we’ll talk about SURT.

Sort-friendly URI Reordering Transform (SURT) as described in the doc, converts URIs of the form: scheme://userinfo@domain.tld:port/path?query#fragment into scheme://(tld,domain,:port@userinfo)/path?query#fragment

The idea was good (see above for more details) and we decided to implement the SURT + Canonicalization algorithms to support legacy WARC/CDX files in the upcoming Aleph’s playback tool COBALT (in Erlang).

While doing so, we conducted some side by side test comparison to make sure the two implementations’ results match. To do so, we used the CDX files generously provided by Internet Archives.

Please note that we are only interested in CDX files as they provide the SURT version of each archived URL.

We were really surprised to find lot of gotchas, some of them are listed below:

SURT + Canonicalization (truncated ///)
Original URL http://api.buzzurl.jp/api/counter/http:///www.e-research.biz/statistics/sta_13/002740.html
Expected jp,buzzurl,api)/api/counter/http://www.e-research.biz/statistics/sta_13/002740.html
Wayback Machine jp,buzzurl,api)/api/counter/http:/www.e-research.biz
/statistics/sta_13/002740.html
COBALT jp,buzzurl,api)/api/counter/http://www.e-research.biz/statistics/sta_13/002740.html

 

SURT + Canonicalization (magic &)
Original URL URL http://www.viciouscycleworks.us/khxc/index.php?
app=ccp0&ns=catshow&ref=kawasaki&prodsort=PRICEUP&
sid=ze2509b1v54ldiu17199jlcf8n88cquz
Expected us,viciouscycleworks)/khxc/index.php?
app=ccp0&ns=catshow&prodsort=priceup&ref=kawasaki
Wayback Machine us,viciouscycleworks)/khxc/index.php?&
app=ccp0&ns=catshow&prodsort=priceup&ref=kawasaki
COBALT us,viciouscycleworks)/khxc/index.php?
app=ccp0&ns=catshow&prodsort=priceup&ref=kawasaki

 

SURT + Canonicalization (unsorted arguments: “href” before “url”)
Original URL http://www.highschoolmusicalgames.us/frame/frame.php?url=www.simpy.com/simpy/LinkAdd.do?href=www.highschoolmusicalgames.us/play-flash-7593.html-Naruto%20Games%20Sakura%20dress
Expected us,highschoolmusicalgames)/frame/frame.php?href=www.highschoolmusicalgames.us/play-flash-7593.html-naruto%20games%20sakura%20dress?url=www.simpy.com/simpy/linkadd.do
Wayback Machine us,highschoolmusicalgames)/frame/frame.php?url=www.simpy.com/simpy/linkadd.do?href=www.highschoolmusicalgames.us/play-flash-7593.html-naruto%20games%20sakura%20dress
COBALT us,highschoolmusicalgames)/frame/frame.php?href=www.highschoolmusicalgames.us/play-flash-7593.html-naruto%20games%20sakura%20dress?url=www.simpy.com/simpy/linkadd.do

 

SURT + Canonicalization (missing /)
Original URL http://www.polish-online.com/wordpress/?p=187
Expected com,polish-online)/wordpress/?p=187
Wayback Machine com,polish-online)/wordpress?p=187
COBALT com,polish-online)/wordpress/?p=187

We found thousands of inconsistencies like these.

Consequences: the SURT algorithm is central in the webarchiving replay infrastructure.
The wayback machine indexes billions of URLs using this algorithm in files known as “index files” (CDX) that can rapidly grow to millions of lines (terabytes in size).
The problem when a bug like the one outlined in this post appears, is that fixing the bug is not enough as thoushands of index files must be regenerated (a costly operation).

That leaves the users of this API with two options, a real dilemma:
1. fix the Java API and regenerate the CDX files (assuming the cost/time of the operation)
2. stick with the buggy API and its consequences: missing resources, bad dublicates detection of URLs, etc.

Learned lesson: test your code/algorithms carefully, especially when they will be used as a base in other programs with high impact.

by Younès HAFRI

Archiving Twitter, not so easy!

Few days ago, someone advertised how successfully they archived Twitter account
here.

update april 10th, 2013: the above archived Twitter account seems to be deleted. Why? Who knows…

At first glance, everything seems OK.

Hmmm … lets have a deeper look (click the image):

First, the date on the URL address bar (20100520091305 read 2010 05 20… a timestamp used internally by the archive server and that should point to a snapshot of the tweets done around the 20th May 2010) doesn’t match the one in the tweet (Fri Oct 12 2012 10:01).

Second, whatever the archiving date you select, all the archived tweets are returned.

Finally, the last tweet is supposed to be the one of Sep 13 2012 (click the first image), but the result set returned tweets that occurred after that date (eg. Fri Oct 12 2012 10:01).

A non experienced user can easily miss these.

Learned lesson: don’t use/abuse the same (old) tools to archive new web medias, especially social ones.


by Younès HAFRI