HTML
Supported formats
HTML
Description
HTML (HyperText Markup Language) is the standard markup language used to create and design the structure of web pages. It defines the content, layout, and organization of elements like text, images, links, forms, and multimedia that users see in their web browsers. This backend analyzes HTML objects, collects various statistics and extracts interesting data for further inspection.
info
Available in Contextal Platform 1.0 and later.
Features
The backend collects the following data:
- HTML language & data encoding
- details of image tags
- links
- unique hosts
- input types
- details of forms & script tags
- extracted text
- extracted data from "data" URLs (RFC 2397)
Symbols
Object
CHAR_DECODING_ERRORS
→ issues were faced while converting the input data into UTF-8LIMITS_REACHED
→ limits triggered while processing the document
Children
RFC2397
→ the child was created out of a "data" URL schemeTOOBIG
→ this child object (containing extracted text from HTML) was truncated as it exceeds the limits
Example Metadata
{
"org": "ctx",
"object_id": "7f6daa7faec84b91a8f7a8e406715852648a4c1af08b13d421079d6d42e0a7d8",
"object_type": "HTML",
"object_subtype": null,
"recursion_level": 1,
"size": 406250,
"hashes": {
"md5": "84563d29ecb50a71f54a372b9bb19a30",
"sha256": "7f6daa7faec84b91a8f7a8e406715852648a4c1af08b13d421079d6d42e0a7d8",
"sha512": "94b205f9ea1f40a2b68ed80a83b39aa703028a9e7778def2f5b878c9db393ce521ddfd2e307ae0c9c7bb22403ec09fed00ef7759f106b733f73ec3b3aac23217",
"sha1": "e5858ac13a0a7bdb64433cf59d0529566c933c55"
},
"ctime": 1718127191.362101,
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"encoding": "utf-8",
"forms": [],
"href": [],
"img_data_src": [],
"img_src": [],
"input_types": [],
"lang": "en-AU",
"scripts": [],
"tag_count": 4,
"tag_counters": {
"a": 1,
"p": 2,
"title": 1
},
"unique_hosts": []
},
"children": [
{
"org": "ctx",
"object_id": "587bfa9fe6162e0c74dfaa1e48b2ff1b596f803b95648d362daa412cf9dcbb3a",
"object_type": "Office",
"object_subtype": null,
"recursion_level": 2,
"size": 250807,
"hashes": {
"sha1": "ee867ac81fcb3e51995dcf90aaad659cd340a750",
"md5": "065fee6d19cb04e56ab15b1682c463b6",
"sha256": "587bfa9fe6162e0c74dfaa1e48b2ff1b596f803b95648d362daa412cf9dcbb3a",
"sha512": "a9a4715ec8d374cfce9137bd5fa91e662f90bade289d07c7f9f134b5bef53338502642c2f14b745946a4dba8e3afb2d914e2aa32dad410c9a27e33837beedbe1"
},
"ctime": 1718127191.362101,
"relation_metadata": {
"decoded_size": 250807,
"encoded_size": 334443,
"mime_type": "application/msword"
},
"ok": {
"symbols": [
"DOCX",
"INFECTED",
"INFECTED-CLAM-Doc.Dropper.Agent-7004486-0",
"RFC2397",
"VBA"
],
[...]
{
"org": "ctx",
"object_id": "e82c1bcfa91fdae0e0539b619800994face1a10e663885f6b793c8db2e18ca4c",
"object_type": "Text",
"object_subtype": null,
"recursion_level": 2,
"size": 1753,
"hashes": {
"md5": "66226276503ad9e96dd86b35eaec4749",
"sha1": "ece3315b56447ed9caa497576fa2ca60d22765e7",
"sha256": "e82c1bcfa91fdae0e0539b619800994face1a10e663885f6b793c8db2e18ca4c",
"sha512": "1cbbac6c7401d8cd546de92f6bd817f910b25f4bda058b2e2d8a02631516d2d6ecfcd81df057a7314213e1d8db80cab12493446f40e181784ed8666330fb2a95"
},
"ctime": 1718127191.362101,
"relation_metadata": {},
"ok": {
"symbols": [
"ALL_ASCII",
"A_LOT_OF_NUMBERS"
],
"object_metadata": {
"_backend_version": "1.0.0",
"encoding": "utf-8",
"number_of_ascii_range_chars": 1753,
"number_of_characters": 1753,
"number_of_digits": 363,
"number_of_newlines": 3,
"number_of_whitespaces": 70,
"number_of_words": 70,
"possible_passwords": [],
"uris": []
},
"children": []
}
}
]
}
}
Example Queries
object_type == "HTML"
&& @has_descendant(
object_type == "PE"
|| object_type == "LNK"
|| (object_type == "Office" && @has_symbol("VBA"))
)
- This query matches a
HTML
object, which at some level has a descendant ofPE
orLNK
type, orOffice
withVBA
macros.
object_type == "HTML"
&& @has_child(object_type == "Text"
&& @match_object_meta($natural_language_sentiment.compound < 0))
- This matches an
HTML
, out of which a text with a negative language sentiment was extracted.
Configuration Options
max_processed_size
→ maximum size of the input object that will be processed (default: 10485760)max_children
→ maximum number of children objects to create (default: 50)max_child_size
→ maximum size of a single new children object (default: 3145728)