Text
Supported formats
Text objects in ASCII or Unicode formats.
Description
The Contextal Platform is capable of extracting text from within many different file formats, that's why the text backend is among the most often used ones. It is capable of extracting various statistics, indicators and other data that could be later contextualized.
Available in Contextal Platform 1.0 and later.
Features
Text Statistics
The details about the text structure are collected, such as a number of characters, digits, words, lines, or whitespace.
Natural Language Detection and Analysis
The backend can detect the natural (human-spoken) language (object metadata: natural_language
) used in the text. For the English language, additionally a sentiment analysis and profanity detection is enabled.
Programming Language Detection
Common scripting languages, which can be easily executed or are often part of threat toolkits are detected (object metadata: programming_language
). This includes Python, shell scripts, JavaScript, and others. Compiled languages are not detected, as we don't see a practical use case here.
Credit Card Number Detection
When a technically valid credit card number is detected, it will be signaled with the CC_NUMBER
symbol assigned to the object.
Possible Password Collection
The backend will collect possible passwords, which could be used for contextual auto-decryption purposes.
URI Extraction
Common URI formats will be extracted for further analysis.
Symbols
Object
CHAR_DECODING_ERRORS
→ issues were faced while converting the input data into UTF-8CODE_ALL_COMMENTS
→ a programming language was detected but the code was all commented outCC_NUMBER
→ a possible credit card number was detected in the textMANY_NUMBERS
→ the text contains 10-50% of numbersMOSTLY_NUMBERS
→ the text contains more than 50% of numbers (but not all)ALL_NUMBERS
→ the text only contains numbersALL_ASCII
→ the text only contains ASCII characters
Example Metadata
{
"org": "ctx",
"object_id": "e5316dd6d511e5bbc456f0ec8fcde61a9344d54177818cd6c01fb2c2d9d254fd",
"object_type": "Text",
"object_subtype": null,
"recursion_level": 7,
"size": 163,
"hashes": {
"md5": "397d848c8cd9e3f3aad5664dc9191f0d",
"sha1": "fb62667261ba85853b5318828cb1bdefca980822",
"sha256": "e5316dd6d511e5bbc456f0ec8fcde61a9344d54177818cd6c01fb2c2d9d254fd",
"sha512": "0c72871eaa285b2c5c2ae3f74cd4a090d67641877676d595178a909f276ef6b916d0f47306f5593c66c3c9189054e32cf4c9bfac8b20c912b9dcf2c915c5e102"
},
"ctime": 1724667831.143179,
"relation_metadata": {},
"ok": {
"symbols": [
"OCR"
],
"object_metadata": {
"_backend_version": "1.0.0",
"encoding": "utf-8",
"natural_language": "English",
"natural_language_profanity_count": 0,
"natural_language_sentiment": {
"compound": 0.561048269402528,
"neg": 0,
"neu": 0.8384279475982532,
"pos": 0.1615720524017467
},
"number_of_ascii_range_chars": 155,
"number_of_characters": 159,
"number_of_digits": 0,
"number_of_newlines": 4,
"number_of_whitespaces": 26,
"number_of_words": 31,
"possible_passwords": [],
"uris": []
}
}
}
Note: The OCR
symbol in the above metadata means, the text was obtained from performing optical character recognition on a graphics data.
Example Queries
object_type == "Text"
&& @match_object_meta($programming_language == "JavaScript")
&& @match_object_meta($number_of_newlines == 0)
- This query matches a
JavaScript
obfuscated into a single line.
object_type == "Text"
&& @match_object_meta($natural_language_profanity_count > 0)
- This query matches English text with profanities. For non-English texts the second condition will always be false.
Configuration Options
max_processed_size
→ maximum text input size (default: 10485760)natural_language_max_char_whitespace_ratio
→ maximumnumber_of_characters / number_of_whitespaces
ratio to consider running the natural language detection (default: 20.0)natural_language_min_confidence_level
→ minimum natural language confidence level to report. From 0.0 to 1.0. (default: 0.2)create_url_children
→ whether to createURL
children for further processing. As of Contextal Platform 1.0 only URLs coming from OCR'd text will be taken into account. (default: true)