TRENS Tutorial

TRENS lets you convert between different characters encodings in Erlang with ease and safety.

1. Convert Shift JIS (Japanese) text to UTF-8

Suppose you want to convert a Japanese to UTF-8.
Here’s how to proceed with TRENS:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
%% copy the following into a file called: "iconv.erl"
-module(iconv).

-export([to_utf8/1]).

to_utf8(InText) ->
   {ok, Pid} = trens:start_link(),
   Port = trens:open(),

   %% set chars encodings:  INPUT_ENC    OUTPUT_ENC
   ok = trens:setenc(Port, "shift-jis", "utf-8"),
   {ok, OutText} = trens:convert(Port, InText),

   ok = trens:close(Port),
   OutText.
Compile:
c:\> erlc iconv.erl
Then run:
c:\> werl
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

1> UTF8 = iconv:to_utf8("少なくとも1つの").
[..]
2> q().

Recap

  • Lines 7,8: launch the linked-in driver and create a new TRENS instance (Port).
  • Line 11: set current setting for character encoding conversion. The input encoding in the example is Shift JIS (Japanese).
  • Line 12: the trens:convert/2 call performs the character set conversion. In case of a success, the
    {ok, Binary} tuple is returned. Otherwise, {more, Binary, Left} is returned where Left is the remaining
    unconverted data. An exception is thrown in case of error.
  • Line 14: free all resources used by the TRENS instance.
  • Line 15: return the converted text.



2. Supported encodings

The list of all supported character encoding can be displayed with:
c:\> werl
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)

1> trens:list().
   437 CP437 IBM437 CSPC8CODEPAGE437
   850 CP850 IBM850 CSPC850MULTILINGUAL
   852 CP852 IBM852 CSPCP852
   [..]

2> q().
You can also print this list from the TRENS script:
c:\> trens_cmd.escript -l

or on Unix/like:

$ ./trens_cmd.escript -l



3. NULL terminator

Not all multibyte charsets can be null-terminated with a single null byte.
For example, UCS2 needs 2 null bytes, and UCS4 needs 4.
TRENS can append a 4 null bytes after a successful conversion if desired.

Simply use the ‘eol’ atom in the call to trens:convert/3:

trens:convert(Port, <<"...">>, 'eol').

Note

By default, no extra null bytes are appended.
This is equivalent to use ‘no_eol’ or to call trens:convert/2.



4. CONTROL BUFFER SIZE

One of the handiest TRENS’ features is that it lets you control the buffer size of the input data to convert.
Suppose you want to convert a 1GB text file. By default, TRENS will split it on multiple chunks of 4KB each.
Thus, the memory usage will never exceed 4KB at any given time.

The below example specifies 16KB per chunk:

trens:convert(Port, <<"...">>, 'no_eol', 16 * 1024).

Note

The maximum allowed chunk size is 10MB.



5. Advanced features

Some of libiconv‘s features are not well documented. These include the //IGNORE and //TRANSLIT extensions.
  1. IGNORE: simply discard any invalid sequences, and attempt to continue the conversion.
  2. TRANSLIT: tells to transliterate characters, or convert characters in the origin encoding to the closest possible matching character in the target encoding.
  3. BOTH: means ‘TRANSLIT’ and ‘IGNORE’ are both set in this order. You may encounter situations where you neeed some characters transliterated, but the others ignored. ‘BOTH’ can help on these cases.

Here are some examples:

%% IGNORE
trens:setenc(Port, "IN_ENCODING", "OUT_ENCODING", 'ignore').

%% TRANSLIT
trens:setenc(Port, "IN_ENCODING", "OUT_ENCODING", 'translit').

%% BOTH
trens:setenc(Port, "IN_ENCODING", "OUT_ENCODING", 'both').



6. TRENS’ command line

TRENS provides a native command line tool equivalent to the Unix command iconv.





c:\> trens_cmd.escript --help
A GNU 'iconv' clone.

Usage: trens_cmd <-f ENCODING> [-t ENCODING] [-o] [-e] [-m] [-s CHUNKSIZE] [INPUTFILE] [OUTPUTFILE]
or:    trens_cmd -l

Converts text from one encoding to another encoding.

The input text is considered as a stream, and is converted
chunk by chunk. This allows to convert huge files without
impacting RAM.

If no input file is specified, TRENS reads from "stdin".
If no output file is specified, TRENS writes to "stdout".

Options controlling the input and output format:
  -f ENCODING, --from-code=ENCODING
                              the encoding of the input
  -t ENCODING, --to-code=ENCODING
                              the encoding of the output (default: UTF-8)
  -o                          overwrite output file (default: no)
  -e                          appends a 4 null bytes after a successful
                              conversion (default: no)
  -m EXTENSION, --mode=EXTENSION
                              . 'ignore': simply discard any invalid sequences,
                                 and attempt to continue the conversion (equiv.
                                 to libiconv //IGNORE)
                              . 'translit': tells to transliterate characters,
                                 or convert characters in the origin encoding
                                 to the closest possible matching character in
                                 the target encoding (equiv. to libiconv //TRANSLIT)
                              . 'both': means 'translit' and 'ignore' are both set
                                 (equiv. to libiconv //TRANSLIT//IGNORE)
  -s CHUNKSIZE                chunk split size (default: 4096 bytes).
                              The minimum (resp. maximum) chunk's size is 4B (resp. 10MB).

Informative output:
  -l, --list                  list the supported encodings
  --help                      display this help and exit
  --version                   output version information and exit

Report bugs at: http://webarchivingbucket.com/tracker/

Copyright (C) 2010-2013 Aleph Archives, Inc.
See COPYRIGHT.pdf and 3rdLIC.pdf

When running from shell, always ensure your locales are correctly set:

$ export LC_ALL="fr_FR.UTF-8"
$ echo -n "éléphant" | ./trens_cmd.escript -f "UTF-8" -t "ISO-8859-1" > out.txt