Skip to main content

HTML

Supported formats

HTML

Description

HTML (HyperText Markup Language) is the standard markup language used to create and design the structure of web pages. It defines the content, layout, and organization of elements like text, images, links, forms, and multimedia that users see in their web browsers. This backend analyzes HTML objects, collects various statistics and extracts interesting data for further inspection.

info

Available in Contextal Platform 1.0 and later.

Features

The backend collects the following data:

  • HTML language & data encoding
  • details of image tags
  • links
  • unique hosts
  • input types
  • details of forms & script tags
  • extracted text
  • extracted data from "data" URLs (RFC 2397)

Symbols

Object

  • CHAR_DECODING_ERRORS → issues were faced while converting the input data into UTF-8
  • LIMITS_REACHED → limits triggered while processing the document

Children

  • RFC2397 → the child was created out of a "data" URL scheme
  • TOOBIG → this child object (containing extracted text from HTML) was truncated as it exceeds the limits

Example Metadata

{
"org": "ctx",
"object_id": "7f6daa7faec84b91a8f7a8e406715852648a4c1af08b13d421079d6d42e0a7d8",
"object_type": "HTML",
"object_subtype": null,
"recursion_level": 1,
"size": 406250,
"hashes": {
"md5": "84563d29ecb50a71f54a372b9bb19a30",
"sha256": "7f6daa7faec84b91a8f7a8e406715852648a4c1af08b13d421079d6d42e0a7d8",
"sha512": "94b205f9ea1f40a2b68ed80a83b39aa703028a9e7778def2f5b878c9db393ce521ddfd2e307ae0c9c7bb22403ec09fed00ef7759f106b733f73ec3b3aac23217",
"sha1": "e5858ac13a0a7bdb64433cf59d0529566c933c55"
},
"ctime": 1718127191.362101,
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"encoding": "utf-8",
"forms": [],
"href": [],
"img_data_src": [],
"img_src": [],
"input_types": [],
"lang": "en-AU",
"scripts": [],
"tag_count": 4,
"tag_counters": {
"a": 1,
"p": 2,
"title": 1
},
"unique_hosts": []
},
"children": [
{
"org": "ctx",
"object_id": "587bfa9fe6162e0c74dfaa1e48b2ff1b596f803b95648d362daa412cf9dcbb3a",
"object_type": "Office",
"object_subtype": null,
"recursion_level": 2,
"size": 250807,
"hashes": {
"sha1": "ee867ac81fcb3e51995dcf90aaad659cd340a750",
"md5": "065fee6d19cb04e56ab15b1682c463b6",
"sha256": "587bfa9fe6162e0c74dfaa1e48b2ff1b596f803b95648d362daa412cf9dcbb3a",
"sha512": "a9a4715ec8d374cfce9137bd5fa91e662f90bade289d07c7f9f134b5bef53338502642c2f14b745946a4dba8e3afb2d914e2aa32dad410c9a27e33837beedbe1"
},
"ctime": 1718127191.362101,
"relation_metadata": {
"decoded_size": 250807,
"encoded_size": 334443,
"mime_type": "application/msword"
},
"ok": {
"symbols": [
"DOCX",
"INFECTED",
"INFECTED-CLAM-Doc.Dropper.Agent-7004486-0",
"RFC2397",
"VBA"
],
[...]
{
"org": "ctx",
"object_id": "e82c1bcfa91fdae0e0539b619800994face1a10e663885f6b793c8db2e18ca4c",
"object_type": "Text",
"object_subtype": null,
"recursion_level": 2,
"size": 1753,
"hashes": {
"md5": "66226276503ad9e96dd86b35eaec4749",
"sha1": "ece3315b56447ed9caa497576fa2ca60d22765e7",
"sha256": "e82c1bcfa91fdae0e0539b619800994face1a10e663885f6b793c8db2e18ca4c",
"sha512": "1cbbac6c7401d8cd546de92f6bd817f910b25f4bda058b2e2d8a02631516d2d6ecfcd81df057a7314213e1d8db80cab12493446f40e181784ed8666330fb2a95"
},
"ctime": 1718127191.362101,
"relation_metadata": {},
"ok": {
"symbols": [
"ALL_ASCII",
"A_LOT_OF_NUMBERS"
],
"object_metadata": {
"_backend_version": "1.0.0",
"encoding": "utf-8",
"number_of_ascii_range_chars": 1753,
"number_of_characters": 1753,
"number_of_digits": 363,
"number_of_newlines": 3,
"number_of_whitespaces": 70,
"number_of_words": 70,
"possible_passwords": [],
"uris": []
},
"children": []
}
}
]
}
}

Example Queries

object_type == "HTML"
&& @has_descendant(
object_type == "PE"
|| object_type == "LNK"
|| (object_type == "Office" && @has_symbol("VBA"))
)
  • This query matches a HTML object, which at some level has a descendant of PE or LNK type, or Office with VBA macros.
object_type == "HTML"
&& @has_child(object_type == "Text"
&& @match_object_meta($natural_language_sentiment.compound < 0))
  • This matches an HTML, out of which a text with a negative language sentiment was extracted.

Configuration Options

  • max_processed_size → maximum size of the input object that will be processed (default: 10485760)
  • max_children → maximum number of children objects to create (default: 50)
  • max_child_output_size → maximum size of a single new children object (default: 3145728)
  • create_domain_children → whether to create Domain children out of collected domain names for further processing (default: true)