1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | %% copy the following into a file called: "iconv.erl"
-module(iconv).
-export([to_utf8/1]).
to_utf8(InText) ->
{ok, Pid} = trens:start_link(),
Port = trens:open(),
%% set chars encodings: INPUT_ENC OUTPUT_ENC
ok = trens:setenc(Port, "shift-jis", "utf-8"),
{ok, OutText} = trens:convert(Port, InText),
ok = trens:close(Port),
OutText.
|
c:\> erlc iconv.erl
c:\> werl
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1 (abort with ^G)
1> UTF8 = iconv:to_utf8("少なくとも1つの").
[..]
2> q().
c:\> werl
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.1 (abort with ^G)
1> trens:list().
437 CP437 IBM437 CSPC8CODEPAGE437
850 CP850 IBM850 CSPC850MULTILINGUAL
852 CP852 IBM852 CSPCP852
[..]
2> q().
c:\> trens_cmd.escript -l
or on Unix/like:
$ ./trens_cmd.escript -l
Simply use the ‘eol’ atom in the call to trens:convert/3:
trens:convert(Port, <<"...">>, 'eol').
Note
The below example specifies 16KB per chunk:
trens:convert(Port, <<"...">>, 'no_eol', 16 * 1024).
Note
Here are some examples:
%% IGNORE
trens:setenc(Port, "IN_ENCODING", "OUT_ENCODING", 'ignore').
%% TRANSLIT
trens:setenc(Port, "IN_ENCODING", "OUT_ENCODING", 'translit').
%% BOTH
trens:setenc(Port, "IN_ENCODING", "OUT_ENCODING", 'both').
c:\> trens_cmd.escript --help
A GNU 'iconv' clone.
Usage: trens_cmd <-f ENCODING> [-t ENCODING] [-o] [-e] [-m] [-s CHUNKSIZE] [INPUTFILE] [OUTPUTFILE]
or: trens_cmd -l
Converts text from one encoding to another encoding.
The input text is considered as a stream, and is converted
chunk by chunk. This allows to convert huge files without
impacting RAM.
If no input file is specified, TRENS reads from "stdin".
If no output file is specified, TRENS writes to "stdout".
Options controlling the input and output format:
-f ENCODING, --from-code=ENCODING
the encoding of the input
-t ENCODING, --to-code=ENCODING
the encoding of the output (default: UTF-8)
-o overwrite output file (default: no)
-e appends a 4 null bytes after a successful
conversion (default: no)
-m EXTENSION, --mode=EXTENSION
. 'ignore': simply discard any invalid sequences,
and attempt to continue the conversion (equiv.
to libiconv //IGNORE)
. 'translit': tells to transliterate characters,
or convert characters in the origin encoding
to the closest possible matching character in
the target encoding (equiv. to libiconv //TRANSLIT)
. 'both': means 'translit' and 'ignore' are both set
(equiv. to libiconv //TRANSLIT//IGNORE)
-s CHUNKSIZE chunk split size (default: 4096 bytes).
The minimum (resp. maximum) chunk's size is 4B (resp. 10MB).
Informative output:
-l, --list list the supported encodings
--help display this help and exit
--version output version information and exit
Report bugs at: http://webarchivingbucket.com/tracker/
Copyright (C) 2010-2013 Aleph Archives, Inc.
See COPYRIGHT.pdf and 3rdLIC.pdf
When running from shell, always ensure your locales are correctly set:
$ export LC_ALL="fr_FR.UTF-8"
$ echo -n "éléphant" | ./trens_cmd.escript -f "UTF-8" -t "ISO-8859-1" > out.txt