4. Programming Languages

As RESTful Web Service, UXTR can be accessible from any programming language.

4.1. PHP

PHP is a general-purpose programming language. It provides a stable binding to libcurl,
a library that allows to communicate with many different types of protocols (HTTP, FTP, etc.).
The following is a sketch/pseudocode of PHP program.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/usr/bin/env php

function read_body($ch, $string)
{
   $length = strlen($string);
   echo "$string\n";
   return $length;
}

...

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'http://' . $server . ':' . $port . '/?s=uxtr&o=' . $output . '&u=' . $uid . '&t=' . $target);
curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'read_body');

curl_exec($ch);

if ($error = curl_error($ch)) {
  echo "Error: $error\n";
  exit(1);
}
A fully working version (uxtr.php) is available at:
This program depends on Pharse, a command-line option-parsing class for PHP.
We assume pharse.php is located in the current directory.
$ sudo apt-get install php5-cli php5-curl
$ php ./uxtr.php --help
Usage: uxtr.php [options]
Options:
--uid, -u <s> Unique identifier or tag
--output, -o <s>
 Output producer type
--target, -t <s>
 Target URL
--server, -s <s>
 UXTR host name
--port, -p <s> UXTR port name
--version, -v <i>
 show program’s version number and exit
--help, -h Display this help banner
Some usage examples:
$ php ./uxtr.php -o json  -u 321 -t "http://www.nasa.gov/"
$ php ./uxtr.php -o plain -u zyx -t "http://www.cern.ch/"

4.2. Python

The Python programming language defines classes which implement the client side of
the HTTP and HTTPS protocols.
The following is a sketch/pseudocode of Python program which uses the asynchronous
socket handler API.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/env python

import asyncore, socket

class HTTPClient(asyncore.dispatcher):

 def __init__(self, host, port, path):
     asyncore.dispatcher.__init__(self)
     self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
     self.connect( (host, port) )
     self.buffer = 'GET %s HTTP/1.0\r\n\r\n' % path

 def handle_connect(self):
     pass

 def handle_close(self):
     self.close()

 def handle_read(self):
     print self.recv(1024)


if __name__ == "__main__":
 host   = '127.0.0.1'
 port   = 6789

 output = 'json'
 uid    = '012345'
 target = 'http://perma.cc/'

 path   = '/?s=uxtr&o=%s&u=%s&t=%s' % (output, uid, target)

 client = HTTPClient(host, port, path)
 asyncore.loop()
A fully working version (uxtr.py) is available at:
$ python ./uxtr.py --help
Usage: uxtr.py [options]
Options:
--version show program’s version number and exit
-h, --help show this help message and exit
-u UID, --uid=UID
 Unique identifier or tag
-o OUTPUT, --output=OUTPUT
 Output producer type
-t TARGET, --target=TARGET
 Target URL
-s HOST, --server=HOST
 UXTR host name
-p PORT, --port=PORT
 UXTR port number
Some usage examples:
$ python ./uxtr.py -o json  -u 321 -t "http://www.nasa.gov/"
$ python ./uxtr.py -o plain -u zyx -t "http://www.cern.ch/"
$ python ./uxtr.py -o raw   -u foo -t "http://techcrunch.com/"

4.3. Ruby

Same applies to Ruby. The following code snippet depicts how to connect to UXTR and retrieve all extracted links.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
...
class UXTRTCPClient < EM::RubySockets::TcpClient

  def on_connected
  end

  def on_disconnected
    EM.stop_event_loop
  end

  def on_connection_error error
    puts "%s" % [error.inspect]
    EM.stop_event_loop
  end

  def unbind
    disconnect
    EM.stop_event_loop
  end

  def receive_data data
    if data && data.length >0
      puts data
    else
      disconnect
      EM.stop_event_loop
    end
  end
end

def main(argv)
  options = OptParser.parse(argv)
  path = "/?s=uxtr&o=%s&u=%s&t=%s" % [options.out, options.uid, options.target]
  EM.run do
     conn = EM::RubySockets.tcp_connect options.host, options.port, UXTRTCPClient
     conn.send_data "GET %s HTTP/1.0\r\n\r\n" % [path]
  end
end

main(ARGV)
A fully working version (uxtr.rb) is available at:
$ gem install em-ruby-sockets
$ ruby ./uxtr.rb --help
usage: uxtr.rb [options]
-u, --uid UID Unique identifier or tag
-t, --target TARGET
 Target URL
-s, --server HOST
 UXTR host name
-p, --port PORT
 UXTR port number
-o, --output OUTPUT
 Output producer type
-h, --help show this help message and exit
-v, --version show program’s version number and exit
Some usage examples:
$ ruby ./uxtr.rb -o json  -u 321 -t "http://www.nasa.gov/"
$ ruby ./uxtr.rb -o plain -u zyx -t "http://www.cern.ch/"
$ ruby ./uxtr.rb -o raw   -u foo -t "http://techcrunch.com/"

4.4. Perl

Accessing UXTR from Perl is pretty straightforward. Dozen of HTTP modules exist to perform the task.
The following code snippet shows how to connect to UXTR and retrieve all extracted links.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
...
sub main
{
   my $BUFSZ = 1024;
   my $s = Net::HTTP::NB->new(Host => "$host:$port") || die $@;
   $s->write_request(GET => "/?s=uxtr&o=$output&u=$uid&t=$target");

   my $sel = IO::Select->new($s);

 READ_HEADER: {
     die "Header timeout" unless $sel->can_read(10);
     my($code, $mess, %h) = $s->read_response_headers;
     redo READ_HEADER unless $code;
   }

   while (1) {
       die "Body timeout" unless $sel->can_read(10);
       my $buf;
       my $n = $s->read_entity_body($buf, BUFSZ);
       last unless $n;
       print $buf;
   }
  }
...
main(ARGV)
A fully working version (uxtr.pl) is available at:
$ sudo perl -MCPAN -eshell
cpan[1]> install Net::HTTP::NB
cpan[1]> install Getopt::Long

$ perl ./uxtr.prl --help
Usage: uxtr.prl [options]
Options:
--version show program’s version number and exit
-h, --help show this help message and exit
-u UID, --uid=UID
 Unique identifier or tag
-o OUTPUT, --output=OUTPUT
 Output producer type
-t TARGET, --target=TARGET
 Target URL
-s HOST, --server=HOST
 UXTR host name
-p PORT, --port=PORT
 UXTR port number
Some usage examples:
$ perl ./uxtr.prl -o json  -u 321 -t "http://www.nasa.gov/"
$ perl ./uxtr.prl -o plain -u zyx -t "http://www.cern.ch/"

4.5. Java

This tutorial wouldn’t be complete without a Java example.
The following code snippet shows how to connect to UXTR and retrieve all extracted links.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
...
OkHttpClient client = new OkHttpClient();

 void run() throws IOException {
     get(new URL("http://" + host + ":" + port + "?s=uxtr&o=" + output + "&u=" + uid + "&t=" + target));
 }

 void get(URL url) throws IOException {
     HttpURLConnection connection = client.open(url);
     InputStream in = null;
     try {
         // Read the response.
         in = connection.getInputStream();
         readChunks(in);
     } finally {
         if (in != null) in.close();
     }
 }

 void readChunks(InputStream in) throws IOException {
     byte[] buffer = new byte[BUF_SZ];
     for (int count; (count = in.read(buffer)) != -1;) {
         System.out.print(new String(buffer, "UTF-8"));
     }
 }

 public static void main(String[] args) throws IOException {
     params(args);
     new UXTR().run();
 }
A fully working version (UXTR.java) is available at:
This program depends on the excellent OkHTTP and commons-cli packages.
We assume both jars are present in the current directory.
$ export JARS=$PWD/okhttp-1.2.1-jar-with-dependencies.jar:$PWD/commons-cli-1.2.jar

$ javac -cp .:$JARS UXTR.java
$ java  -cp .:$JARS UXTR --help
usage: UXTR [options]
-o, --output <arg>
 Output producer type
-p, --port <arg>
 UXTR port name
-s, --server <arg>
 UXTR host name
-t, --target <arg>
 Target URL
-u, --uid <arg>
 Unique identifier or tag
-v, --version Show program’s version number and exit
Some usage examples:
$ java -cp .:$JARS UXTR -o json  -u 321 -t "http://www.nasa.gov/"
$ java -cp .:$JARS UXTR -o plain -u zyx -t "http://www.cern.ch/"

4.6. C#

Microsoft CSharp provides a complete TCP stack.
We can reach the UXTR service through different connectors like WebClient, WebRequest, RestSharp, etc.
The following code snippet shows how to connect to UXTR and retrieve all extracted links using the WebRequest class.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
...
static public void Main(string[] args)
     {
         try
         {
             var url = "http://" + host + ":" + port + "/?s=uxtr&o=" + output + "&u=" + uid + "&t=" + target;
             var req = WebRequest.Create(url);

             using (var response = req.GetResponse())
             {
                 using (var responseStream = response.GetResponseStream())
                 {
                     var buffer = new byte[BUF_SZ];
                     int bytesRead;
                     do
                     {
                         bytesRead = responseStream.Read(buffer, 0, BUF_SZ);
                         Console.Write(System.Text.Encoding.UTF8.GetString(buffer, 0, bytesRead));
                     } while (bytesRead > 0);
                 }
             }
         }
A fully working version (uxtr.cs) is available at:
This program depends on NDesk.Options. Use NuGet package manager with Visual Studio or Mono to install it.
NuGET with Microsoft Visual Studio

Then, simply build the project and run it.
c:\> uxtr.exe --help

UXTR CSharp/C# interface


Some usage examples:
c:\> uxtr.exe -o json  -u 321 -t "http://www.nasa.gov/"
c:\> uxtr.exe -o plain -u zyx -t "http://www.cern.ch/"

4.7. Go Lang

The Go programming language provides a native HTTP client implementation.
The following code snippet shows how to connect to UXTR and retrieve all extracted links.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
...
func main() {
     flag.Parse()

     if len(os.Args) == 1 {
             fmt.Println("For usage: uxtr.go --help")
             os.Exit(1)
     }

     lnk := "http://" + *host + ":" + *port + "/?s=uxtr&o=" + *output + "&u=" + *uid + "&t=" + *target

     url, err := url.Parse(lnk)
     checkError(err)

     client := &http.Client{}

     request, err := http.NewRequest("GET", url.String(), nil)
     checkError(err)

     response, err := client.Do(request)
     if response.Status != "200 OK" {
             fmt.Println(response.Status)
             os.Exit(2)
     }

     var buf [512]byte
     reader := response.Body
     for {
             n, err := reader.Read(buf[0:])
             if err != nil {
                     os.Exit(0)
             }
             fmt.Print(string(buf[0:n]))
     }

     os.Exit(0)
}
A fully working version (uxtr.go) is available at:
$ go run ./uxtr.go --help
Usage of uxtr.go:
-o =”plain”: Output producer type
-output =”plain”: Output producer type
-p =”6789”: UXTR port name
-port =”6789”: UXTR port name
-s =”127.0.0.1”: UXTR host name
-server =”127.0.0.1”: UXTR host name
-t =”http://perma.cc/”: Target URL
-target =”http://perma.cc/”: Target URL
-u =”a123z”: Unique identifier or tag
-uid =”a123z”: Unique identifier or tag
-v =false: show program’s version number and exit
-version =false: show program’s version number and exit
Some usage examples:
$ go run ./uxtr.go -o json  -u 321 -t "http://www.nasa.gov/"
$ go run ./uxtr.go -o plain -u zyx -t "http://www.cern.ch/"

4.8. Lua

The Lua programming language is widely used as a scripting language by game programmers.
It’s small, easy to embed, and fast with a short learning curve.
The following code snippet shows how to connect to UXTR and retrieve all extracted links.
1
2
3
4
5
6
local socket = require "socket.http"
local ltn12 = require "ltn12"
...
client,r,c,h = socket.request{
 url = "http://" .. server .. ":" .. port .. "/?s=uxtr&o=" .. output .. "&u=" .. uid .. "&t=" .. target,
 sink = ltn12.sink.file(io.stdout),

}

A fully working version (uxtr.lua) is available at:
The program depends on luasocket and cliargs. Install them with:
$ sudo apt-get install lua50 luarocks
$ sudo luarocks install luasocket

$ sudo luarocks install luasec luarocks install luasec OPENSSL_LIBDIR=/usr/lib/i386-linux-gnu/    # for 32-bit
$ sudo luarocks install luasec luarocks install luasec OPENSSL_LIBDIR=/usr/lib/x86_64-linux-gnu/  # for 64-bit

$ sudo luarocks install https://raw.github.com/amireh/lua_cliargs/master/lua_cliargs-2.1-2.rockspec
Show supported options:
$ lua ./uxtr.lua --help
UXTR with Lua
Some usage examples:
$ lua ./uxtr.lua -o json  -u 321 -t "http://www.nasa.gov/"
$ lua ./uxtr.lua -o plain -u zyx -t "http://www.cern.ch/"

4.9. Erlang

Erlang is the default programming language for UXTR. Both a REST and a native API are available.
In the below example, we’ll use the native API via call to uxtr:xtract/1.
# first, attach to the running node (in production):
$ cd $HOME/.uxtr
$ ./bin/uxtr attach
% then type:
(uxtr_attach@aleph)1> uxtr:xtract("http://www.nasa.gov/").

(uxtr_attach@aleph)1> flush().
Shell got {recv_start,7377360}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/">>}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/foresee/foresee-trigger.js">>}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/sites/all/themes/custom/NASAOmegaHTML5/js/redirection-mobile.js">>}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/modules/system/system.base.css?mtwble">>}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/modules/system/system.menus.css?mtwble">>}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/modules/system/system.messages.css?mtwble">>}
...

4.10. Node.js

The purpose of the next Node.js program is to follow links starting from a given page,
and drawing a graph of links between all the visited pages down to a given depth level.
We store the collected information about each page in a graphNode object that will be
used later to format our final dot script output.
The crawl function is the most interesting part of the program: for each level, we make an
HTTP GET request to the UXTR host server with the target page URL as a parameter.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
...
function crawl(level, urls, nextLevelUrls, cb, finalize) {
     if (urls.length==0) {
         if (level==argv.n-1 || nextLevelUrls.length==0) {finalize(); return;}
         console.log('Hopping to level %d',level);
         crawl(level+1, nextLevelUrls, [], cb, finalize);
     } else {
         var u = urls.shift();
         console.log("Analyzing: %s", u.path);
         http.get("http://"+argv.s+":"+argv.p+"/?s=uxtr&o=json&u=1&t="+host+u.path,
             function(res) {
                 var buff=[];
                 res.on("data", function(data) {
                     buff.push(data);
                 });
                 res.on("end", function() {
                     var links=filterLinks(JSON.parse(buff.join('')));
                     links.forEach(function(e) {
                         var newNode = cb(e.url, e.kind, u);
                         newNode && e.kind==1 && nextLevelUrls.push(newNode);
                     });
                     crawl(level, urls, nextLevelUrls, cb, finalize);
                 });
             }
         );
     }
 }
 ...
The server in turn responds with a formatted list of all links. We filter-out that list to make sure
we don’t harvest a page more than once.
The last peace of code generates a DOT Graph file by visiting all graphNodes.
The complete source code (uxtr.js) is available at:
Basic usage:
## install the "optimist" package
$ npm install optimist

## then, run it ("-n 2" means depth level 2):
$ node uxtr.js -s 127.0.0.1 -p 6789 -n 2 -t http://perma.cc/

Analyzing: /
Creating new node and linking
Creating new node and linking
Creating new node and linking
...
Hopping to level 0
Analyzing: /about
Creating new node and linking
Analyzing: /login
...
Creating new node and linking
Analyzing: /register
Analyzing: /privacy-policy
Analyzing: /copyright-policy
Generating dot script...

digraph LinkGraph {
0 [label="/" shape=square]
0 -> 1
0 -> 2
0 -> 3
0 -> 4
0 -> 5
...
1 [label="/static/css/bootstrap3.css" shape=square]
2 [label="/static/css/style-responsive.css" shape=square]
3 [label="/static/css/carousel.css" shape=square]
4 [label="/static/js/modernizr.js" shape=square]
...
67 [label="/static/css/bootstrap.css" shape=square]
68 [label="/static/css/bootstrap-responsive.css" shape=square]
69 [label="/static/css/style.css" shape=square]
70 [label="/static/js/bootstrap.js" shape=square]
}
For the sake of clarity, the generated DOT Graph file is available at:
Then, we use the Graphiz DOT command to generate the final PNG image.
$ wget http://webarchivingbucket.com/uxtr/ex/perma.dot
$ dot -Tpng perma.dot > perma.png
And the final graph image (~7Mb PNG) is:
Perma.cc Graphiz DOT Node.js