6. Configuration

The UXTR’s main configuration is file called app.config.
It’s located under:
# Windows
c:\uxtr-1.0\etc\app.config

# Mac OSX
/Applications/uxtr/etc/app.config

# Linux
$HOME/.uxtr/etc/app.config
Options are explained below:
[{included_applications,[]},
 %% UXTR's config
 {uxtr, [
         %% network settings for REST API
         {rest_api, true},
         {rest_ip_addr, {0,0,0,0}}, %% or "::" for IPv6
         {rest_port, 6789}, %% > 1024
         {rest_protocol, tcp}, %% tcp | ssl

         %% network settings for administration (internal)
         {admin_api, true},
         {ip_addr, {127,0,0,1}},
         {port, 2345}, %% > 1024

         %% number of extractors at boot
         {engines, auto}, %% no | auto | twice | 1 .. 50

         %% keep alive conns.
         {keepalive, true}, %% boolean()
         {keepalive_max, 99}, %% 1 .. 1000
         {keepalive_tmout, 3000}, %% 3sec. timeout (hard-coded)

         %% how many "binds" before giving-up?
         {binds, 30}, %% 0 .. 100

         %% how many "retries" before giving-up?
         {retries, 1}, %% 0 .. 10

         %% use proxy to access the net.
         {proxy, false},
         {proxy_addr, {127,0,0,1}},
         {proxy_port, 3128}, %% ex. Squid proxy

         %% parallel pages per bot?
         {pages, 1}, %% 1 .. 3

         %% page timeout in sec.
         {page_timeout, 45}, %% 30 ..

         %% max. number of extractions before renewal
         {renewal, 33}, %% 2 .. 1000

         %% TCP heart beat mode
         {hb, false},        %% activate 'heart beat' or not?
         {hb_interval, 30},  %% 30 sec.
         {hb_tmout, 3},      %% 3 sec.

         %% Valeo heart beat mode
         {hb2, true},

         %% caching on/off
         {disable_cache, true} %% boolean()
        ]},

 %% Lager's config (for logs)
 {lager,
       [
        {error_logger_hwm, 30},
        {async_threshold, 20},
        {async_threshold_window, 5},
        {handlers,
         [
          {lager_console_backend, [debug, true]},
          {lager_file_backend, [{file, "var/log/uxtr/error.log"}, {level, error}, {size, 10485760}, {date, "$D0"}, {count, 5}]},
          {lager_file_backend, [{file, "var/log/uxtr/info.log"},  {level, info},  {size, 10485760}, {date, "$D0"}, {count, 5}]},
          {lager_file_backend, [{file, "var/log/uxtr/debug.log"}, {level, debug}, {size, 10485760}, {date, "$D0"}, {count, 5}]}
         ]},
        {error_logger_redirect, true},
        {crash_log, "var/log/uxtr/crash.log"},
        {crash_log_msg_size, 65536},
        {crash_log_size, 10485760},
        {crash_log_date, "$D0"},
        {crash_log_count, 5}
       ]}
Default settings are usually good for general purposes.
But you may want to adapt them to fit your needs.

6.1. Architecture

The following picture depicts the architecture of a live UXTR node.
It’s based on 4-Core CPU machine with default settings.
A target URL http://techcrunch.com/ is being processed for links extraction by one (proc id: <0.174.0>) of the 4 engines.
UXTR Architecture

6.2. Settings

6.2.1. Ports

If the default port numbers are already in use on your system, you can easily change them.
  • UXTR port number for the RESTful API is 6789 (key rest_port)
  • UXTR port number for the ADMIN API is 2345 (key port)

IPv6 IP addresses are natively supported.

6.2.2. Listening Interface

The key rest_ip_addr is used to restrict access to the REST API to a given interface.
By default, it’s set to {0,0,0,0} which accepts connections originating from any IP Address.
To restrict connections to local processes only, use {127,0,0,1}.

6.2.3. Engines

The key engines controls how many extractors are started at boot time.
  • no: no instance is started
  • auto: N instances are started, where N equals the number of CPU cores of your machine
  • twice: N instances are started, where N equals twice the number of CPU cores of your machine
  • N: N instances are started, where 1 <= N <= 50

Each UXTR instance can manage more than one extractor at a time (up to 50).

6.2.4. Retries After Failure

The key retries controls the number of attempts an extractor have to make before giving-up.
Suppose retries equals 2. This means the extractor will retry 2 times after a failed extraction, then give-up.

6.2.5. Proxy Support

The key proxy control if UXTR should use a proxy to access the NET or not.

6.2.6. Page Load Time

Some websites are fast, some are slow. The key page_timeout manages the wait-time for a webpage to load.
If the timeout is reached, UXTR (force) stops loading the page and proceeds to the analysis.

6.2.7. Logging

UXTR uses Lager, a cross-platform logging framework for Erlang. Its purpose is to perform
application logging with support for multiple backends, logrotate, syslog, etc.
The section lager is in charge of this. All logs reside under:
# Windows
c:\uxtr-1.0\var\log\uxtr\

# Mac OSX
/Applications/uxtr/var/log/uxtr/

# Linux
$HOME/.uxtr/var/log/uxtr/
Please refer to lager’s documentation and adapt the parameters to your needs.

6.2.8. Other Options

All remaining options should be left with their default values, unless you REALLY know what you’re doing.

6.3. Tuning

6.3.1. Open File Limits

UXTR can consume a large number of open file handles during normal operation. In particular,
the UXTR TCP Handler backend may accumulate a number of TCP connections when the keep-alive
directive is in use.
  1. Linux: on most Linux distributions, the total limit for open files is controlled by sysctl.

Edit /etc/security/limits.conf and append the following lines to the file:

*               soft     nofile          200000
*               hard     nofile          200000

If you will be accessing the UXTR via secure shell (ssh), then you should also edit /etc/ssh/sshd_config to read:

UseLogin yes
UsePrivilegeSeparation no

Restart the machine so that the limits to take effect and verify that the new limits are set with the following command:

$ ulimit -n 65536
$ ulimit -n
  1. Mac OSX: to check the current limits on your Mac OSX system, run:
$ launchctl limit maxfiles
maxfiles    8192     16384

The last two columns are the soft and hard limits, respectively.

To adjust the maximum open file limits in OSX 10.7 (Lion) or newer, edit /etc/launchd.conf and increase the limits
for both values as appropriate.
For example, to set the soft limit to 32768 files, and the hard limit to 128000 files, perform the following steps.
Edit (or create) /etc/launchd.conf and increase the limits to read:
$ grep maxfiles /etc/launchd.conf
limit maxfiles 32768 128000
Save the file, and restart the system for the new limits to take effect.
After restarting, verify the new limits with the launchctl limit command:
$ launchctl limit
 cpu         unlimited      unlimited
 filesize    unlimited      unlimited
 data        unlimited      unlimited
 stack       8388608        67104768
 core        0              unlimited
 rss         unlimited      unlimited
 memlock     unlimited      unlimited
 maxproc     1064           1064
 maxfiles    32768          128000

6.3.2. Kernel & Network Tuning

The following settings are minimally sufficient to improve many aspects of UXTR usage on Linux,
and should be added or updated in /etc/sysctl.conf:
net.ipv4.conf.all.rp_filter               = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.default.rp_filter           = 1
net.ipv4.ip_forward                       = 0
net.ipv4.ip_local_port_range              = 1024 65535
net.ipv4.tcp_congestion_control           = bic
net.ipv4.tcp_dsack                        = 0
net.ipv4.tcp_ecn                          = 0
net.ipv4.tcp_fack                         = 0
net.ipv4.tcp_fin_timeout                  = 15
net.ipv4.tcp_keepalive_intvl              = 30
net.ipv4.tcp_keepalive_probes             = 3
net.ipv4.tcp_keepalive_time               = 120
net.ipv4.tcp_low_latency                  = 1
net.ipv4.tcp_max_syn_backlog              = 40000
net.ipv4.tcp_max_tw_buckets               = 8000000
net.ipv4.tcp_mem                          = 30000000 30000000 30000000
net.ipv4.tcp_moderate_rcvbuf              = 1
net.ipv4.tcp_no_metrics_save              = 1
net.ipv4.tcp_orphan_retries               = 0
net.ipv4.tcp_retries1                     = 3
net.ipv4.tcp_retries2                     = 15
net.ipv4.tcp_rmem                         = 30000000 30000000 30000000
net.ipv4.tcp_sack                         = 1
net.ipv4.tcp_slow_start_after_idle        = 0
net.ipv4.tcp_syn_retries                  = 1
net.ipv4.tcp_synack_retries               = 1
net.ipv4.tcp_syncookies                   = 0
net.ipv4.tcp_timestamps                   = 0
net.ipv4.tcp_tw_recycle                   = 1
net.ipv4.tcp_tw_reuse                     = 1
net.ipv4.tcp_window_scaling               = 1
net.ipv4.tcp_wmem                         = 30000000 30000000 30000000

net.core.netdev_max_backlog               = 400000
net.core.optmem_max                       = 16777216
net.core.rmem_default                     = 16777216
net.core.rmem_max                         = 16777216
net.core.somaxconn                        = 102400
net.core.wmem_default                     = 16777216
net.core.wmem_max                         = 16777216

kernel.core_uses_pid                      = 1
kernel.shmmax                             = 67108864
kernel.sysrq                              = 0

fs.file-max                               = 200000
#vm.swappiness                            = 0

Save the file, and restart the system for the new limits to take effect.