Text

Supported formats

Text objects in ASCII or Unicode formats.

Description

The Contextal Platform is capable of extracting text from within many different file formats, that's why the text backend is among the most often used ones. It is capable of extracting various statistics, indicators and other data that could be later contextualized.

info

Available in Contextal Platform 1.0 and later.

Features

Text Statistics

The details about the text structure are collected, such as a number of characters, digits, words, lines, or whitespace.

Natural Language Detection and Analysis

The backend can detect the natural (human-spoken) language (object metadata: natural_language) used in the text. For the English language, additionally a sentiment analysis and profanity detection is enabled.

Programming Language Detection

Common scripting languages, which can be easily executed or are often part of threat toolkits are detected (object metadata: programming_language). This includes Python, shell scripts, JavaScript, and others. Compiled languages are not detected, as we don't see a practical use case here.

Credit Card Number Detection

When a technically valid credit card number is detected, it will be signaled with the CC_NUMBER symbol assigned to the object.

Possible Password Collection

The backend will collect possible passwords, which could be used for contextual auto-decryption purposes.

URI Extraction

Common URI formats will be extracted for further analysis.

Symbols

Object

CHAR_DECODING_ERRORS → issues were faced while converting the input data into UTF-8
CODE_ALL_COMMENTS → a programming language was detected but the code was all commented out
CC_NUMBER → a possible credit card number was detected in the text
MANY_NUMBERS → the text contains 10-50% of numbers
MOSTLY_NUMBERS → the text contains more than 50% of numbers (but not all)
ALL_NUMBERS → the text only contains numbers
ALL_ASCII → the text only contains ASCII characters

Example Metadata

{
  "org": "ctx",
  "object_id": "e5316dd6d511e5bbc456f0ec8fcde61a9344d54177818cd6c01fb2c2d9d254fd",
  "object_type": "Text",
  "object_subtype": null,
  "recursion_level": 7,
  "size": 163,
  "hashes": {
    "md5": "397d848c8cd9e3f3aad5664dc9191f0d",
    "sha1": "fb62667261ba85853b5318828cb1bdefca980822",
    "sha256": "e5316dd6d511e5bbc456f0ec8fcde61a9344d54177818cd6c01fb2c2d9d254fd",
    "sha512": "0c72871eaa285b2c5c2ae3f74cd4a090d67641877676d595178a909f276ef6b916d0f47306f5593c66c3c9189054e32cf4c9bfac8b20c912b9dcf2c915c5e102"
  },
  "ctime": 1724667831.143179,
  "relation_metadata": {},
  "ok": {
    "symbols": [
      "OCR"
    ],
    "object_metadata": {
      "_backend_version": "1.0.0",
      "encoding": "utf-8",
      "natural_language": "English",
      "natural_language_profanity_count": 0,
      "natural_language_sentiment": {
        "compound": 0.561048269402528,
        "neg": 0,
        "neu": 0.8384279475982532,
        "pos": 0.1615720524017467
      },
      "number_of_ascii_range_chars": 155,
      "number_of_characters": 159,
      "number_of_digits": 0,
      "number_of_newlines": 4,
      "number_of_whitespaces": 26,
      "number_of_words": 31,
      "possible_passwords": [],
      "uris": []
    }
  }
}

Note: The OCR symbol in the above metadata means, the text was obtained from performing optical character recognition on a graphics data.

Example Queries

object_type == "Text"
    && @match_object_meta($programming_language == "JavaScript")
    && @match_object_meta($number_of_newlines == 0)

This query matches a JavaScript obfuscated into a single line.

object_type == "Text"
    && @match_object_meta($natural_language_profanity_count > 0)

This query matches English text with profanities. For non-English texts the second condition will always be false.

Configuration Options

max_processed_size → maximum text input size (default: 10485760)
natural_language_max_char_whitespace_ratio → maximum number_of_characters / number_of_whitespaces ratio to consider running the natural language detection (default: 20.0)
natural_language_min_confidence_level → minimum natural language confidence level to report. From 0.0 to 1.0. (default: 0.2)
create_url_children → whether to create URL children for further processing. As of Contextal Platform 1.0 only URLs coming from OCR'd text will be taken into account. (default: true)

Supported formats​

Description​

Features​

Text Statistics​

Natural Language Detection and Analysis​

Programming Language Detection​

Credit Card Number Detection​

Possible Password Collection​

URI Extraction​

Symbols​

Object​

Example Metadata​

Example Queries​

Configuration Options​