4. Programming Languages
As RESTful Web Service, UXTR can be accessible from
any programming language.
4.1. PHP
PHP is a general-purpose programming language. It provides a stable binding to
libcurl,
a library that allows to communicate with many different types of protocols (HTTP, FTP, etc.).
The following is a sketch/pseudocode of PHP program.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 | #!/usr/bin/env php
function read_body($ch, $string)
{
$length = strlen($string);
echo "$string\n";
return $length;
}
...
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://' . $server . ':' . $port . '/?s=uxtr&o=' . $output . '&u=' . $uid . '&t=' . $target);
curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'read_body');
curl_exec($ch);
if ($error = curl_error($ch)) {
echo "Error: $error\n";
exit(1);
}
|
A fully working version (uxtr.php) is available at:
This program depends on
Pharse, a
command-line option-parsing class for PHP.
We assume pharse.php is located in the current directory.
$ sudo apt-get install php5-cli php5-curl
$ php ./uxtr.php --help
Usage: uxtr.php [options]
- Options:
--uid, -u <s> |
Unique identifier or tag |
--output, -o <s> |
| Output producer type |
--target, -t <s> |
| Target URL |
--server, -s <s> |
| UXTR host name |
--port, -p <s> |
UXTR port name |
--version, -v <i> |
| show program’s version number and exit |
--help, -h |
Display this help banner |
$ php ./uxtr.php -o json -u 321 -t "http://www.nasa.gov/"
$ php ./uxtr.php -o plain -u zyx -t "http://www.cern.ch/"
4.2. Python
The Python programming language defines classes which implement the client side of
the HTTP and HTTPS protocols.
The following is a sketch/pseudocode of Python program which uses the asynchronous
socket handler API.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34 | #!/usr/bin/env python
import asyncore, socket
class HTTPClient(asyncore.dispatcher):
def __init__(self, host, port, path):
asyncore.dispatcher.__init__(self)
self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
self.connect( (host, port) )
self.buffer = 'GET %s HTTP/1.0\r\n\r\n' % path
def handle_connect(self):
pass
def handle_close(self):
self.close()
def handle_read(self):
print self.recv(1024)
if __name__ == "__main__":
host = '127.0.0.1'
port = 6789
output = 'json'
uid = '012345'
target = 'http://perma.cc/'
path = '/?s=uxtr&o=%s&u=%s&t=%s' % (output, uid, target)
client = HTTPClient(host, port, path)
asyncore.loop()
|
A fully working version (uxtr.py) is available at:
$ python ./uxtr.py --help
- Options:
--version |
show program’s version number and exit |
-h, --help |
show this help message and exit |
-u UID, --uid=UID |
| Unique identifier or tag |
-o OUTPUT, --output=OUTPUT |
| Output producer type |
-t TARGET, --target=TARGET |
| Target URL |
-s HOST, --server=HOST |
| UXTR host name |
-p PORT, --port=PORT |
| UXTR port number |
$ python ./uxtr.py -o json -u 321 -t "http://www.nasa.gov/"
$ python ./uxtr.py -o plain -u zyx -t "http://www.cern.ch/"
$ python ./uxtr.py -o raw -u foo -t "http://techcrunch.com/"
4.3. Ruby
Same applies to Ruby. The following code snippet depicts
how to connect to UXTR and retrieve all extracted links.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40 | ...
class UXTRTCPClient < EM::RubySockets::TcpClient
def on_connected
end
def on_disconnected
EM.stop_event_loop
end
def on_connection_error error
puts "%s" % [error.inspect]
EM.stop_event_loop
end
def unbind
disconnect
EM.stop_event_loop
end
def receive_data data
if data && data.length >0
puts data
else
disconnect
EM.stop_event_loop
end
end
end
def main(argv)
options = OptParser.parse(argv)
path = "/?s=uxtr&o=%s&u=%s&t=%s" % [options.out, options.uid, options.target]
EM.run do
conn = EM::RubySockets.tcp_connect options.host, options.port, UXTRTCPClient
conn.send_data "GET %s HTTP/1.0\r\n\r\n" % [path]
end
end
main(ARGV)
|
A fully working version (uxtr.rb) is available at:
$ gem install em-ruby-sockets
$ ruby ./uxtr.rb --help
-u, --uid UID |
Unique identifier or tag |
-t, --target TARGET |
| Target URL |
-s, --server HOST |
| UXTR host name |
-p, --port PORT |
| UXTR port number |
-o, --output OUTPUT |
| Output producer type |
-h, --help |
show this help message and exit |
-v, --version |
show program’s version number and exit |
$ ruby ./uxtr.rb -o json -u 321 -t "http://www.nasa.gov/"
$ ruby ./uxtr.rb -o plain -u zyx -t "http://www.cern.ch/"
$ ruby ./uxtr.rb -o raw -u foo -t "http://techcrunch.com/"
4.4. Perl
Accessing UXTR from Perl is pretty straightforward. Dozen of HTTP modules exist to perform the task.
The following code snippet shows how to connect to UXTR and retrieve all extracted links.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 | ...
sub main
{
my $BUFSZ = 1024;
my $s = Net::HTTP::NB->new(Host => "$host:$port") || die $@;
$s->write_request(GET => "/?s=uxtr&o=$output&u=$uid&t=$target");
my $sel = IO::Select->new($s);
READ_HEADER: {
die "Header timeout" unless $sel->can_read(10);
my($code, $mess, %h) = $s->read_response_headers;
redo READ_HEADER unless $code;
}
while (1) {
die "Body timeout" unless $sel->can_read(10);
my $buf;
my $n = $s->read_entity_body($buf, BUFSZ);
last unless $n;
print $buf;
}
}
...
main(ARGV)
|
A fully working version (uxtr.pl) is available at:
$ sudo perl -MCPAN -eshell
cpan[1]> install Net::HTTP::NB
cpan[1]> install Getopt::Long
$ perl ./uxtr.prl --help
Usage: uxtr.prl [options]
- Options:
--version |
show program’s version number and exit |
-h, --help |
show this help message and exit |
-u UID, --uid=UID |
| Unique identifier or tag |
-o OUTPUT, --output=OUTPUT |
| Output producer type |
-t TARGET, --target=TARGET |
| Target URL |
-s HOST, --server=HOST |
| UXTR host name |
-p PORT, --port=PORT |
| UXTR port number |
$ perl ./uxtr.prl -o json -u 321 -t "http://www.nasa.gov/"
$ perl ./uxtr.prl -o plain -u zyx -t "http://www.cern.ch/"
4.5. Java
This tutorial wouldn’t be complete without a Java example.
The following code snippet shows how to connect to UXTR and retrieve all extracted links.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30 | ...
OkHttpClient client = new OkHttpClient();
void run() throws IOException {
get(new URL("http://" + host + ":" + port + "?s=uxtr&o=" + output + "&u=" + uid + "&t=" + target));
}
void get(URL url) throws IOException {
HttpURLConnection connection = client.open(url);
InputStream in = null;
try {
// Read the response.
in = connection.getInputStream();
readChunks(in);
} finally {
if (in != null) in.close();
}
}
void readChunks(InputStream in) throws IOException {
byte[] buffer = new byte[BUF_SZ];
for (int count; (count = in.read(buffer)) != -1;) {
System.out.print(new String(buffer, "UTF-8"));
}
}
public static void main(String[] args) throws IOException {
params(args);
new UXTR().run();
}
|
A fully working version (UXTR.java) is available at:
We assume both jars are present in the current directory.
$ export JARS=$PWD/okhttp-1.2.1-jar-with-dependencies.jar:$PWD/commons-cli-1.2.jar
$ javac -cp .:$JARS UXTR.java
$ java -cp .:$JARS UXTR --help
-o, --output <arg> |
| Output producer type |
-p, --port <arg> |
| UXTR port name |
-s, --server <arg> |
| UXTR host name |
-t, --target <arg> |
| Target URL |
-u, --uid <arg> |
| Unique identifier or tag |
-v, --version |
Show program’s version number and exit |
$ java -cp .:$JARS UXTR -o json -u 321 -t "http://www.nasa.gov/"
$ java -cp .:$JARS UXTR -o plain -u zyx -t "http://www.cern.ch/"
4.6. C#
Microsoft CSharp provides a complete TCP stack.
The following code snippet shows how to connect to UXTR and retrieve all extracted links using the WebRequest class.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 | ...
static public void Main(string[] args)
{
try
{
var url = "http://" + host + ":" + port + "/?s=uxtr&o=" + output + "&u=" + uid + "&t=" + target;
var req = WebRequest.Create(url);
using (var response = req.GetResponse())
{
using (var responseStream = response.GetResponseStream())
{
var buffer = new byte[BUF_SZ];
int bytesRead;
do
{
bytesRead = responseStream.Read(buffer, 0, BUF_SZ);
Console.Write(System.Text.Encoding.UTF8.GetString(buffer, 0, bytesRead));
} while (bytesRead > 0);
}
}
}
|
A fully working version (uxtr.cs) is available at:
Then, simply build the project and run it.
c:\> uxtr.exe -o json -u 321 -t "http://www.nasa.gov/"
c:\> uxtr.exe -o plain -u zyx -t "http://www.cern.ch/"
4.7. Go Lang
The Go programming language provides a native HTTP client implementation.
The following code snippet shows how to connect to UXTR and retrieve all extracted links.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37 | ...
func main() {
flag.Parse()
if len(os.Args) == 1 {
fmt.Println("For usage: uxtr.go --help")
os.Exit(1)
}
lnk := "http://" + *host + ":" + *port + "/?s=uxtr&o=" + *output + "&u=" + *uid + "&t=" + *target
url, err := url.Parse(lnk)
checkError(err)
client := &http.Client{}
request, err := http.NewRequest("GET", url.String(), nil)
checkError(err)
response, err := client.Do(request)
if response.Status != "200 OK" {
fmt.Println(response.Status)
os.Exit(2)
}
var buf [512]byte
reader := response.Body
for {
n, err := reader.Read(buf[0:])
if err != nil {
os.Exit(0)
}
fmt.Print(string(buf[0:n]))
}
os.Exit(0)
}
|
A fully working version (uxtr.go) is available at:
$ go run ./uxtr.go --help
-o |
=”plain”: Output producer type |
-output |
=”plain”: Output producer type |
-p |
=”6789”: UXTR port name |
-port |
=”6789”: UXTR port name |
-s |
=”127.0.0.1”: UXTR host name |
-server |
=”127.0.0.1”: UXTR host name |
-t |
=”http://perma.cc/”: Target URL |
-target |
=”http://perma.cc/”: Target URL |
-u |
=”a123z”: Unique identifier or tag |
-uid |
=”a123z”: Unique identifier or tag |
-v |
=false: show program’s version number and exit |
-version |
=false: show program’s version number and exit |
$ go run ./uxtr.go -o json -u 321 -t "http://www.nasa.gov/"
$ go run ./uxtr.go -o plain -u zyx -t "http://www.cern.ch/"
4.8. Lua
The Lua programming language is widely used as a scripting language by game programmers.
It’s small, easy to embed, and fast with a short learning curve.
The following code snippet shows how to connect to UXTR and retrieve all extracted links.
| local socket = require "socket.http"
local ltn12 = require "ltn12"
...
client,r,c,h = socket.request{
url = "http://" .. server .. ":" .. port .. "/?s=uxtr&o=" .. output .. "&u=" .. uid .. "&t=" .. target,
sink = ltn12.sink.file(io.stdout),
|
}
A fully working version (uxtr.lua) is available at:
The program depends on luasocket and cliargs. Install them with:
$ sudo apt-get install lua50 luarocks
$ sudo luarocks install luasocket
$ sudo luarocks install luasec luarocks install luasec OPENSSL_LIBDIR=/usr/lib/i386-linux-gnu/ # for 32-bit
$ sudo luarocks install luasec luarocks install luasec OPENSSL_LIBDIR=/usr/lib/x86_64-linux-gnu/ # for 64-bit
$ sudo luarocks install https://raw.github.com/amireh/lua_cliargs/master/lua_cliargs-2.1-2.rockspec
$ lua ./uxtr.lua -o json -u 321 -t "http://www.nasa.gov/"
$ lua ./uxtr.lua -o plain -u zyx -t "http://www.cern.ch/"
4.9. Erlang
Erlang is the default programming language for
UXTR. Both a
REST and a
native API are available.
In the below example, we’ll use the
native API via call to
uxtr:xtract/1.
# first, attach to the running node (in production):
$ cd $HOME/.uxtr
$ ./bin/uxtr attach
% then type:
(uxtr_attach@aleph)1> uxtr:xtract("http://www.nasa.gov/").
(uxtr_attach@aleph)1> flush().
Shell got {recv_start,7377360}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/">>}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/foresee/foresee-trigger.js">>}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/sites/all/themes/custom/NASAOmegaHTML5/js/redirection-mobile.js">>}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/modules/system/system.base.css?mtwble">>}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/modules/system/system.menus.css?mtwble">>}
Shell got {recv_url,7377360,0,<<"http://www.nasa.gov/modules/system/system.messages.css?mtwble">>}
...
4.10. Node.js
The purpose of the next
Node.js program is to follow links starting from a given page,
and drawing a graph of links between all the visited pages down to a given depth level.
We store the collected information about each page in a graphNode object that will be
used later to format our final dot script output.
The crawl function is the most interesting part of the program: for each level, we make an
HTTP GET request to the UXTR host server with the target page URL as a parameter.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28 | ...
function crawl(level, urls, nextLevelUrls, cb, finalize) {
if (urls.length==0) {
if (level==argv.n-1 || nextLevelUrls.length==0) {finalize(); return;}
console.log('Hopping to level %d',level);
crawl(level+1, nextLevelUrls, [], cb, finalize);
} else {
var u = urls.shift();
console.log("Analyzing: %s", u.path);
http.get("http://"+argv.s+":"+argv.p+"/?s=uxtr&o=json&u=1&t="+host+u.path,
function(res) {
var buff=[];
res.on("data", function(data) {
buff.push(data);
});
res.on("end", function() {
var links=filterLinks(JSON.parse(buff.join('')));
links.forEach(function(e) {
var newNode = cb(e.url, e.kind, u);
newNode && e.kind==1 && nextLevelUrls.push(newNode);
});
crawl(level, urls, nextLevelUrls, cb, finalize);
});
}
);
}
}
...
|
The server in turn responds with a formatted list of all links. We filter-out that list to make sure
we don’t harvest a page more than once.
The last peace of code generates a
DOT Graph file by visiting all
graphNodes.
The complete source code (uxtr.js) is available at:
## install the "optimist" package
$ npm install optimist
## then, run it ("-n 2" means depth level 2):
$ node uxtr.js -s 127.0.0.1 -p 6789 -n 2 -t http://perma.cc/
Analyzing: /
Creating new node and linking
Creating new node and linking
Creating new node and linking
...
Hopping to level 0
Analyzing: /about
Creating new node and linking
Analyzing: /login
...
Creating new node and linking
Analyzing: /register
Analyzing: /privacy-policy
Analyzing: /copyright-policy
Generating dot script...
digraph LinkGraph {
0 [label="/" shape=square]
0 -> 1
0 -> 2
0 -> 3
0 -> 4
0 -> 5
...
1 [label="/static/css/bootstrap3.css" shape=square]
2 [label="/static/css/style-responsive.css" shape=square]
3 [label="/static/css/carousel.css" shape=square]
4 [label="/static/js/modernizr.js" shape=square]
...
67 [label="/static/css/bootstrap.css" shape=square]
68 [label="/static/css/bootstrap-responsive.css" shape=square]
69 [label="/static/css/style.css" shape=square]
70 [label="/static/js/bootstrap.js" shape=square]
}
For the sake of clarity, the generated DOT Graph file is available at:
Then, we use the
Graphiz DOT command to generate the final PNG image.
$ wget http://webarchivingbucket.com/uxtr/ex/perma.dot
$ dot -Tpng perma.dot > perma.png
And the final graph image (~7Mb PNG) is: