Copyright © 2010-2012 ALEPH ARCHIVES Ltd. All rights reserved.
Version: 1.0.0
Authors: Aleph Archives Ltd. [web site: http://aleph-archives.com/].
Primitives to handle WARC records.
This module allows you to read, parse, check, dump, create and alter WARC records in an intuitive manner.
The underlying WARC record format is abstracted in such a way that you never have to understand its internals.http_eoh() = eoh
http_error() = {error, http_string()}
http_field() = http_field_atom() | http_string()
http_field_atom() = 'Cache-Control' | 'Connection' | 'Date' | 'Pragma' | 'Transfer-Encoding' | 'Upgrade' | 'Via' | 'Accept' | 'Accept-Charset' | 'Accept-Encoding' | 'Accept-Language' | 'Authorization' | 'From' | 'Host' | 'If-Modified-Since' | 'If-Match' | 'If-None-Match' | 'If-Range' | 'If-Unmodified-Since' | 'Max-Forwards' | 'Proxy-Authorization' | 'Range' | 'Referer' | 'User-Agent' | 'Age' | 'Location' | 'Proxy-Authenticate' | 'Public' | 'Retry-After' | 'Server' | 'Vary' | 'Warning' | 'Www-Authenticate' | 'Allow' | 'Content-Base' | 'Content-Encoding' | 'Content-Language' | 'Content-Length' | 'Content-Location' | 'Content-Md5' | 'Content-Range' | 'Content-Type' | 'Etag' | 'Expires' | 'Last-Modified' | 'Accept-Ranges' | 'Set-Cookie' | 'Set-Cookie2' | 'X-Forwarded-For' | 'Cookie' | 'Keep-Alive' | 'Proxy-Connection'
http_header() = {header, http_field(), http_version()}
http_method() = 'OPTIONS' | 'GET' | 'HEAD' | 'POST' | 'PUT' | 'DELETE' | 'TRACE' | http_string()
http_request() = {request, http_method(), http_uri(), http_version()}
http_response() = {response, http_version(), pos_integer(), http_string()}
http_string() = string() | binary()
http_uri() = '*' | {absoluteURI, http | https, http_string(), non_neg_integer() | undefined, http_string()} | {scheme, http_string(), http_string()} | {abs_path, http_string()} | http_string()
http_version() = {non_neg_integer(), non_neg_integer()}
property() = {write_slot_type(), value()}
proplist() = [property()]
read_slot_type() = vsn | type | recid | date | sub_slot_type()
sub_slot_type() = len | recid | ip | uri | mime | conc | bdig | pdig | rto | wid | wfn | wfile | prof | trunc | tyload | segnum | seglen | segorig
validation_error() = {error, found_forbidden_field, atom()} | {error, missing_mandatory_field, atom()} | {error, invalid_version, term()} | {error, invalid_content_type, term()} | {error, invalid_content_length, term()} | {error, invalid_date, term()} | {error, invalid_uri, term()} | {error, invalid_profile, term()} | {error, invalid_ip_address, term()} | {error, invalid_segment_number, term()} | {error, invalid_segment_total_length, term()} | {error, invalid_segment_origin_id, term()} | {error, invalid_record_id, term()} | {error, invalid_mime_type, term()} | {error, invalid_concurrent_to, term()} | {error, invalid_block_digest, term()} | {error, invalid_payload_digest, term()} | {error, invalid_refers_to, term()} | {error, invalid_info_id, term()} | {error, invalid_filename, term()} | {error, invalid_truncated, term()} | {error, invalid_identified_payload_type, term() | {error, record_semantically_invalid, term()}}
value() = calendar:datetime1970() | non_neg_integer() | binary() | inet:ip_address() | file:name() | function() | file | bytes | stream
write_slot_type() = data | source | read_slot_type()
clone_hdr/1 | Clones a WARC record's header block (no matter its type: read/write). |
get/2 | Getter to access WARC record's internal state (i.e fields). |
http_decode/2 | Parse any WARC's HTTP payload 'request' or 'response' as a stream of data (extremely fast). |
is_valid/1 | Is the WARC record valid (syntactic and semantic validity) and compliant with the WARC v1.0 ISO 28500:2009 specifications?. |
new/0 | Returns an empty WARC record for writing. |
new/1 | Returns a new WARC record for writing filled with data from proplist PropList. |
payload/1 | Retrieves the WARC record's payload chunk by chunk. |
payload/2 | Efficiently retrieves the WARC record's payload and dump its content to file Filename on disk. |
set/3 | Setter to update the WARC record internal state (i.e fields). |
unset/2 | Reset the WARC record field to its default value. |
clone_hdr(Record::#wsdk_rrec{} | #wsdk_wrec{}) -> #wsdk_wrec{}
Clones a WARC record's header block (no matter its type: read/write). This call creates a carbon copy of the original record, for writing purposes.
wsdk_record:payload/1
and wsdk_record:payload/2
.
get(Record::#wsdk_rrec{} | #wsdk_wrec{}, FieldName::write_slot_type() | soff) -> value()
Getter to access WARC record's internal state (i.e fields).
http_decode(Selector::status_line | headers, Bin::binary()) -> {ok, http_response() | http_request() | http_header() | http_eoh() | http_error(), binary()} | more | {error, term()}
Parse any WARC's HTTP payload 'request' or 'response' as a stream of data (extremely fast).
- If an entire packet is contained in Bin, it is returned together with the remainder of the binary as {ok,Packet,Rest}.
- If Bin does not contain the entire packet, 'more' is returned. http_decode/2 can then be called again with more data added.
- If the packet does not conform to the HTTP protocol format {error,Reason} is returned.is_valid(Record::#wsdk_rrec{} | #wsdk_wrec{}) -> ok | validation_error()
Is the WARC record valid (syntactic and semantic validity) and compliant with the WARC v1.0 ISO 28500:2009 specifications?
This is a complex operation.
wsdk_record:http_decode/2
.
new() -> #wsdk_wrec{}
Returns an empty WARC record for writing.
new(PropList::proplist()) -> #wsdk_wrec{}
Returns a new WARC record for writing filled with data from proplist PropList.
payload(Record::#wsdk_rrec{}) -> {ok, binary(), #wsdk_rrec{}} | eof | incomplete
Retrieves the WARC record's payload chunk by chunk.
payload(Record::#wsdk_rrec{}, Filename::file:name()) -> ok | incomplete
Efficiently retrieves the WARC record's payload and dump its content to file Filename on disk.
This call ensures that all parent directories exist, trying to create them if necessary. If the output file Filename already exists, it will be erased first.
wsdl_record:payload/1
after wsdl_record:payload/2
(and vice-versa).
set(Record::#wsdk_wrec{}, FieldName::write_slot_type(), Value::value()) -> #wsdk_wrec{}
Setter to update the WARC record internal state (i.e fields).
unset(Record::#wsdk_wrec{}, FieldName::write_slot_type()) -> #wsdk_wrec{}
Reset the WARC record field to its default value.
Generated by EDoc, Sep 5 2012, 17:38:09.