5. Protocol

UXTR offers a RESTful API for links extraction.
The underlying protocol is simple to understand which yields to a fast implementation.
Both HTTP and TCP are used to transport data (i.e extracted links) to the client side.

5.1. Modes

5.1.1. HTTP Mode

In this mode, UXTR behaves like an HTTP web server. The server will stream the response back using the
HTTP Transfer-Encoding: chunked directive, allowing the client to process the extracted links as fast as possible.
This implies an HTTP 1.1 compliant client. It’s strongly discouraged to wait until the full response is received.
Processing the links in parallel as they keep coming is the best approach to save memory and time.
UXTR Transfer-Encoding

5.1.2. TCP Mode

Here, UXTR send data back to the client in raw mode.
It’s up to the client to parse the data (i.e extracted links) following the format described below.

5.2. Request Format

To bind a UXTR extractor for a job, use the following HTTP GET request.
http(s)://HOST:PORT/?s=uxtr&o=json&u=a123&t=ANYURL
You MUST respect the request’s parameters positions.
  • s: Service name. If stated, MUST be uxtr (optional field)
  • o: Producer type. Accepted values are: plain, json, or raw (optional, default to raw)
  • u: An identifier that is send back to the client. Any ASCII chars combination for at most 20 bytes (mandatory field)
  • t: A valid target URL to extract links from (mandatory field)
From the default configuration:
  • HOST: 127.0.0.1
  • PORT: 6789
The following requests are valid:
http://127.0.0.1:6789/?s=uxtr&o=plain&u=123&t=http://www.nasa.gov/
http://127.0.0.1:6789/?s=uxtr&o=json&u=0abc0&t=http://www.nasa.gov/
http://127.0.0.1:6789/?s=uxtr&o=raw&u=zyx&t=http://www.nasa.gov/
...
The following requests are equivalent:
http://127.0.0.1:6789/?s=uxtr&o=raw&u=123&t=http://edition.cnn.com/
http://127.0.0.1:6789/o=raw&u=123&t=http://edition.cnn.com/
http://127.0.0.1:6789/u=123&t=http://edition.cnn.com/
The following requests are invalid:
http://127.0.0.1:6789/?o=plain&s=uxtr&u=123&t=https://twitter.com/
http://127.0.0.1:6789/s=uxtr&u=123&t=https://twitter.com/&?o=plain
...
http://127.0.0.1:6789/?s=uxtR&o=json&u=123&t=https://twitter.com/
http://127.0.0.1:6789/?s=foo&o=plain&u=123&t=https://twitter.com/
http://127.0.0.1:6789/?s=uxtr&o=Raw&u=123&t=https://twitter.com/
http://127.0.0.1:6789/?o=jSon&u=123&t=https://twitter.com/
http://127.0.0.1:6789/?o=bar&u=123&t=https://twitter.com/
...

5.3. Response Format

The web server supports three output modes: raw, plain, and json.
The first two are expressed in the UXTR Link Format specification (see below).
This format is optimized for speed and ease of parsing, while the last mode (JSON-based)
is intended for interoperability.
Below is the Augmented Backus-Naur Form (ABNF) grammar for UXTR’s link format.
CR            = <US-ASCII CR, carriage-return (13)>
LF            = <US-ASCII LF, linefeed (10)>
COMMA         = <US-ASCII comma separator (44)>
CRLF          = CR LF
url           = <URL per RFC1738> ; http://www.ietf.org/rfc/rfc1738.txt

UXTR-url      = url
UXTR-uid      = 1*20<US-ASCII character>  ; Unique Identifier or Tag
UXTR-kind     = 0 | 1 | 2
UXTR-busy     = 9

UXTR-links    = *UXTR-link | UXTR-overload
UXTR-link     = UXTR-uid COMMA UXTR-kind COMMA UXTR-url CRLF
UXTR-overload = UXTR-uid COMMA UXTR-busy COMMA UXTR-url CRLF
From the above description, link’s values are obvious to grasp.
Only UXTR-kind needs a clarification:
  • 0: the extracted link is an embed (i.e mandatory to correctly display the page)
  • 1: the extracted link is an external one
  • 2: end of links stream
Output examples for the plain and raw producers:
123,0,http://www.nasa.gov/foresee/foresee-trigger.js
123,0,http://search.usa.gov/javascripts/remote.loader.js
123,1,http://www.nasa.gov/news/media/info/index.html
...
123,2,
Output example for the json producer:
{
"links": [
    {
        "kind": 0,
        "url": "http://techcrunch.com/"
    },
    {
        "kind": 0,
        "url": "http://0.gravatar.com/js/gprofiles.js?ver=201345ae"
    },
    {
        "kind": 0,
        "url": "http://s2.wp.com/wp-content/mu-plugins/gravatar-hovercards/wpgroho.js?m=1380573781g"
    },
    {
        "kind": 0,
        "url": "http://s1.wp.com/wp-content/js/devicepx.js?m=1373391538g"
    },
    {
        "kind": 1,
        "url": "http://privacy.aol.com/"
    },
    {
        "kind": 1,
        "url": "http://techcrunch.com/anti-harassment-policy/"
    },
    {
        "kind": 0,
        "url": "http://platform.twitter.com/widgets.js?ver=20111117"
    },
    {
        "kind": 2,
        "url": ""
    }
],
"target": "http://techcrunch.com/",
"uid": "xyz"
}

5.4. Termination & Errors

When the plain or the json producer are in use, the HTTP Client must comply with the HTTP Transfer-Encoding chunked directive.
This means if the last chunk (0\r\n) isn’t received, the caller should decide whether to discard the actual resultset or to try again.
In case of the raw producer, the caller must receive a special link of this kind which indicates the end of links stream:
UXTR-uid,2,
As you can see, there’s no URL after the last comma , (only an invisible CRLF).

5.5. Handling Overloads

In certain circumstances, a UXTR links extraction attempt may fail due to improper setup.
Imagine a UXTR instance with 3 running engines. If you simultaneously try to extract links from 1000 URLs,
most of them will timeout because of lack of resources (see the engines configuration option).
In these particular cases, UXTR will warn you by returning a special message (code 9).
For the plain and raw producers:
UXTR-uid,9,UXTR-url
For the json producer:
{
 "links": [],
 "status": 9,
 "target": "UXTR-url",
 "uid": "UXTR-uid"
 }
Up to the caller to try again or simply give-up.