Skip to main content

Email

Supported formats

RFC 5322

Description

This backend offers a high-performance, adaptable email parser designed for efficient data and metadata extraction. It prioritizes parsing methods that align closely with how modern Mail User Agents handle email content, rather than strictly following official specifications. This approach allows for more realistic and comprehensive analysis of email data, capturing details that are often missed by traditional parsers focused solely on technical standards.

info

Available in Contextal Platform 1.0 and later.

Features

  • Header decoding: minimal parsing, generic validation
  • RFC 2047 and RFC 2184 header character set decoding
  • Body decoding (identity, quoted-printable, base64)
  • MIME multipart support (each concrete part become a child object)
  • Charset aware text conversion to UTF-8 (with replacement) of all inline part bodies
  • Massive extraction of metadata, anomalies, and flaws

Symbols

Object

  • DUP_BCC → duplicate Bcc header
  • DUP_CC → duplicate Cc header
  • DUP_ENVELOPE_TO → duplicate Envelope-To header
  • DUP_FROM → duplicate From header
  • DUP_IN_REPLY_TO → duplicate In-Reply-To header
  • DUP_MESSAGE_ID → duplicate Message-ID header
  • DUP_REPLY_TO → duplicate Reply-To header
  • DUP_RETURN_PATH → duplicate Return-Path header
  • DUP_SUBJECT → duplicate Subject header
  • DUP_TO → duplicate To header
  • FROM_LIST → the mail is apparently from a mailing list
  • INVALID_DATE → the Date header is syntactically invalid
  • INVALID_HEADERS → one or more of the headers are syntactically invalid
  • INVALID_MIME_VER → the Mime-Version header is syntactically invalid or reports an invalid version
  • LIMITS_REACHED → limits triggered while processing the message
  • MISSING_DATE → the mandatory Date header is not present
  • MISSING_FROM → the mandatory From header is not present
  • MISSING_MESSAGE_ID → the mandatory Message-ID header is not present
  • MISSING_MIME_VER → the Mime-Version header is not present
  • MISSING_SUBJECT → the mandatory Subject header is not present
  • MISSING_TO → the To header is not present
  • RESENT → the message appears to have been resent

Children

  • CHARSET_ATTM → one of more attachments (i.e. non inline parts) carry a charset (some malware does this)
  • INVALID_BODY_ENC → this child is a message body and contains flaws in the way it is encoded
  • INVALID_HEADERS → one or more of the part headers are syntactically invalid
  • TOOBIG → this part was not extracted as it exceeds the limits

Example Metadata

{
"org": "ctx",
"object_id": "a7e70dafb3bbc49ff7e284d084ea80e7a687903712c30d54388cfb986062550d",
"object_type": "Email",
"object_subtype": null,
"recursion_level": 1,
"size": 11301,
"hashes": {
"sha256": "a7e70dafb3bbc49ff7e284d084ea80e7a687903712c30d54388cfb986062550d",
"sha1": "1dff07e2f13f7b2171042d924b3a6b647f04671c",
"sha512": "b8de8e5000bbf5d335cbdb37751c5f1ecfd041383d2501c3136e54f5c90e0cfa19ae3f7d595da0c2c7326976f150e3ab72bc03f834b1af35960f8f8aa5b215f7",
"md5": "bcb6d5b532df71d89154946206449487"
},
"ctime": 1725899412.988861,
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"charset": "iso-8859-1",
"date_ts": 1598959647,
"has_html_body": true,
"has_text_body": false,
"hdrs_health": {
"bad_name": false,
"bad_value": false,
"bad_value_encoding": false,
"bad_value_params": false,
"bad_value_quoting": false
},
"headers": [
{
"dup": false,
"name": "from",
"value": "e-mail server bl****ware.com < tc***n@bl****ware.com >"
},
{
"dup": false,
"name": "message-id",
"value": "<20200901042726.3b5c769a539eaed6@bl****ware.com>"
},
{
"dup": false,
"name": "reply-to",
"value": "info@ph****api.live"
},
{
"dup": false,
"name": "return-path",
"value": "<tc***n@bl****ware.com>"
},
{
"dup": false,
"name": "subject",
"value": "email suspension warning for tc***n@bl****ware.com"
},
{
"dup": false,
"name": "to",
"value": "tc***n@bl****ware.com"
}
],
"mime_type": "text/html",
"multipart": false,
"n_attachments": 0
},
"children": [
{
"org": "ctx",
"object_id": "b93e1ad4e327fc4ced0d76d8db5c3c170cd50ca74585131b0c99e88b08e4326b",
"object_type": "HTML",
"object_subtype": null,
"recursion_level": 2,
"size": 8879,
[...]

Example Queries

object_type == "Email"
&& @has_symbol("INVALID_HEADERS")
  • This query matches an Email object with syntactically invalid headers.
object_type == "Email"
&& @has_descendant(object_type == "Text"
&& @match_object_meta($natural_language_sentiment.compound < 0))
  • This matches an Email, out of which at some point a Text with a negative language sentiment was extracted.

Configuration Options

  • max_processed_size → maximum size of the input object that will be processed (default: 262144000)
  • max_children → maximum number of children objects to create (default: 100)
  • max_child_input_size → maximum size of a single input children object (default: 41943040)
  • max_child_output_size → maximum size of a single output children object (default: 41943040)