Supported formats
RFC 5322
Description
This backend offers a high-performance, adaptable email parser designed for efficient data and metadata extraction. It prioritizes parsing methods that align closely with how modern Mail User Agents handle email content, rather than strictly following official specifications. This approach allows for more realistic and comprehensive analysis of email data, capturing details that are often missed by traditional parsers focused solely on technical standards.
info
Available in Contextal Platform 1.0 and later.
Features
- Header decoding: minimal parsing, generic validation
- RFC 2047 and RFC 2184 header character set decoding
- Body decoding (identity, quoted-printable, base64)
- MIME multipart support (each concrete part become a child object)
- Charset aware text conversion to UTF-8 (with replacement) of all inline part bodies
- Massive extraction of metadata, anomalies, and flaws
Symbols
Object
DUP_BCC
→ duplicateBcc
headerDUP_CC
→ duplicateCc
headerDUP_ENVELOPE_TO
→ duplicateEnvelope-To
headerDUP_FROM
→ duplicateFrom
headerDUP_IN_REPLY_TO
→ duplicateIn-Reply-To
headerDUP_MESSAGE_ID
→ duplicateMessage-ID
headerDUP_REPLY_TO
→ duplicateReply-To
headerDUP_RETURN_PATH
→ duplicateReturn-Path
headerDUP_SUBJECT
→ duplicateSubject
headerDUP_TO
→ duplicateTo
headerFROM_LIST
→ the mail is apparently from a mailing listINVALID_DATE
→ theDate
header is syntactically invalidINVALID_HEADERS
→ one or more of the headers are syntactically invalidINVALID_MIME_VER
→ theMime-Version
header is syntactically invalid or reports an invalid versionLIMITS_REACHED
→ limits triggered while processing the messageMISSING_DATE
→ the mandatoryDate
header is not presentMISSING_FROM
→ the mandatoryFrom
header is not presentMISSING_MESSAGE_ID
→ the mandatoryMessage-ID
header is not presentMISSING_MIME_VER
→ theMime-Version
header is not presentMISSING_SUBJECT
→ the mandatorySubject
header is not presentMISSING_TO
→ theTo
header is not presentRESENT
→ the message appears to have been resent
Children
CHARSET_ATTM
→ one of more attachments (i.e. non inline parts) carry a charset (some malware does this)INVALID_BODY_ENC
→ this child is a message body and contains flaws in the way it is encodedINVALID_HEADERS
→ one or more of the part headers are syntactically invalidTOOBIG
→ this part was not extracted as it exceeds the limits
Example Metadata
{
"org": "ctx",
"object_id": "a7e70dafb3bbc49ff7e284d084ea80e7a687903712c30d54388cfb986062550d",
"object_type": "Email",
"object_subtype": null,
"recursion_level": 1,
"size": 11301,
"hashes": {
"sha256": "a7e70dafb3bbc49ff7e284d084ea80e7a687903712c30d54388cfb986062550d",
"sha1": "1dff07e2f13f7b2171042d924b3a6b647f04671c",
"sha512": "b8de8e5000bbf5d335cbdb37751c5f1ecfd041383d2501c3136e54f5c90e0cfa19ae3f7d595da0c2c7326976f150e3ab72bc03f834b1af35960f8f8aa5b215f7",
"md5": "bcb6d5b532df71d89154946206449487"
},
"ctime": 1725899412.988861,
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"charset": "iso-8859-1",
"date_ts": 1598959647,
"has_html_body": true,
"has_text_body": false,
"hdrs_health": {
"bad_name": false,
"bad_value": false,
"bad_value_encoding": false,
"bad_value_params": false,
"bad_value_quoting": false
},
"headers": [
{
"dup": false,
"name": "from",
"value": "e-mail server bl****ware.com < tc***n@bl****ware.com >"
},
{
"dup": false,
"name": "message-id",
"value": "<20200901042726.3b5c769a539eaed6@bl****ware.com>"
},
{
"dup": false,
"name": "reply-to",
"value": "info@ph****api.live"
},
{
"dup": false,
"name": "return-path",
"value": "<tc***n@bl****ware.com>"
},
{
"dup": false,
"name": "subject",
"value": "email suspension warning for tc***n@bl****ware.com"
},
{
"dup": false,
"name": "to",
"value": "tc***n@bl****ware.com"
}
],
"mime_type": "text/html",
"multipart": false,
"n_attachments": 0
},
"children": [
{
"org": "ctx",
"object_id": "b93e1ad4e327fc4ced0d76d8db5c3c170cd50ca74585131b0c99e88b08e4326b",
"object_type": "HTML",
"object_subtype": null,
"recursion_level": 2,
"size": 8879,
[...]
Example Queries
object_type == "Email"
&& @has_symbol("INVALID_HEADERS")
- This query matches an
Email
object with syntactically invalid headers.
object_type == "Email"
&& @has_descendant(object_type == "Text"
&& @match_object_meta($natural_language_sentiment.compound < 0))
- This matches an
Email
, out of which at some point aText
with a negative language sentiment was extracted.
Configuration Options
max_processed_size
→ maximum size of the input object that will be processed (default: 262144000)max_children
→ maximum number of children objects to create (default: 100)max_child_input_size
→ maximum size of a single input children object (default: 41943040)max_child_output_size
→ maximum size of a single output children object (default: 41943040)