Skip to main content

URL

Supported formats

URL (primarily http(s)://)

Description

The backend relies on headless Chromium browser to access provided URL and its related objects, collect various information and interesting objects for further processing.

note

Available in Contextal Platform 1.0 and later.

warning

This backend is currently considered experimental and only URLs found in text coming from an OCR processing will be submitted for processing in Contextal Platform 1.0 (this can be turned off in the Text backend). Due to an unknown nature of remote content it's highly recommended to configure safe/proxied networking for the container running the URL backend (or use the proxy setting, see below).

Features

The backend collects the following data:

  • contents of the given URL and its related objects (images, scripts, CSS, etc.), which from browser's standpoint are necessary to render the URL
  • main document's HTML source code after processing by the browser (including JavaScript modifications, etc.)
  • screenshot of the rendered web page
  • web page saved via print-to-PDF browser function
  • file-download, if the given URL triggers a download process in the browser

Symbols

Object

  • LIMITS_REACHED → limits triggered while processing the URL

Children

  • FETCH_INCOMPLETE → this child object couldn't be fully downloaded
  • FETCH_ERROR → there was an error while trying to download the object
  • TIMEOUT → the request timed out
  • TOOBIG → this child object was not downloaded as it exceeds the limits
  • TOOSMALL → this child object was skipped as it's smaller than 16 bytes (and not an Image type)

Example Metadata

{
"org": "ctx",
"object_id": "0837397276194f38838b5c7680539d79fb20fbd374df5db091d476b4b4202dd2",
"object_type": "URL",
"object_subtype": null,
"recursion_level": 3,
"size": 22,
"hashes": {
"md5": "71949b4258268ad038b142d5fa749fdb",
"sha512": "87af4c5b0e5b8d9b9e7207e09920c9040bb749bb25256d7ccc92c6325b1450f1706d770617369bf2fbf4f92e9a9d36d67d97574904b7cdfc968103b6f008dd09",
"sha256": "0837397276194f38838b5c7680539d79fb20fbd374df5db091d476b4b4202dd2",
"sha1": "4e90d1597083e0f403b160ee5c341a12ca1899e5"
},
"ctime": 1718006523.390496,
"relation_metadata": {
"url": "https://contextal.com/"
},
"ok": {
"symbols": [
"LIMITS_REACHED"
],
"object_metadata": {
"_backend_version": "1.0.0"
},
"children": [
{
"org": "ctx",
"object_id": "e8d4c0691158e2df65f02d2f8e2d9437fb260d0dfd7de609284ff5dc4058d418",
"object_type": "PDF",
"object_subtype": null,
"recursion_level": 4,
"size": 921644,
"hashes": {
"sha512": "9637f21b0964449177d66ef585b056d53ddd9ae7e54df08e675eadf4737395945096fa031e83c26bafec539edc1e798ff55fa725e41e4db469f50b6087dfab53",
"md5": "83630b69f8b2d376738978c753795241",
"sha256": "e8d4c0691158e2df65f02d2f8e2d9437fb260d0dfd7de609284ff5dc4058d418",
"sha1": "6aa2b2a8c09d39981cecbcffa658a0bc689589ba"
},
"ctime": 1718006523.390496,
"relation_metadata": {
"PrintToPdf": {
"url": "https://contextal.com/"
}
},
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"builtin_metadata": {
"creator": "Chromium",
"producer": "Skia/PDF m125",
"title": "Contextal – Security in the Right Context"
},
[...]
{
"org": "ctx",
"object_id": "c499179d7c7e273ff061f5d53996ed54a2058bc8d04d52de19709d178777b12f",
"object_type": "HTML",
"object_subtype": null,
"recursion_level": 4,
"size": 252529,
"hashes": {
"md5": "6b7f269944288ed494b08bda2d490c40",
"sha1": "b512c137b1646b19f15b14dcab82c526afc464f6",
"sha512": "8be7b504c49a5d805a227121884f631e5d62d70f7badfcf7d8884ebfc0db78bba2ff5fb90e1d72881d983b69e5285c77caf480df983e2e9f8b3c925706193a5e",
"sha256": "c499179d7c7e273ff061f5d53996ed54a2058bc8d04d52de19709d178777b12f"
},
"ctime": 1718006523.390496,
"relation_metadata": {
"PageHtmlContent": {
"title": "Contextal – Security in the Right Context",
"url": "https://contextal.com/"
}
},
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"encoding": "utf-8",
"forms": [
{
"action": "/#wpcf7-f198-o1",
"aria-label": "Contact form",
"class": "wpcf7-form mailchimp-ext-0.5.72 init",
"data-status": "init",
"method": "post",
"novalidate": "novalidate"
}
],
[...]
{
"org": "ctx",
"object_id": "039bfe8723a47a07b1770cf171fef9b40074b85636cb4f52b525fabf2ee868db",
"object_type": "Text",
"object_subtype": null,
"recursion_level": 4,
"size": 10644,
"hashes": {
"sha512": "abdf4ad25aeabd65c3962614387cf478397e28a5395a1609c5e2e0714a33b4e3374edd7da86b802ab429914cb7b32c20e72f2c1a2dc6ae85d397e95c137b096c",
"sha256": "039bfe8723a47a07b1770cf171fef9b40074b85636cb4f52b525fabf2ee868db",
"md5": "c0c3501afc3829e6dd70f03c201c8f3f",
"sha1": "8e5b0006f279b1c712b319e92b9ac9d92c4b5223"
},
"ctime": 1718006523.390496,
"relation_metadata": {
"HttpResponse": {
"mime_type": "application/javascript",
"resource_type": "Script",
"status_code": 200,
"status_text": "",
"url": "https://contextal.com/wp-content/14eccb22cbabbcc8a23850eb2e2889d4/dist/2079540005.js?ver=5e6956684b86fca9"
}
},
"ok": {
"symbols": [
"ALL_ASCII"
],
"object_metadata": {
"_backend_version": "1.0.0",
"encoding": "utf-8",
"number_of_ascii_range_chars": 10644,
"number_of_characters": 10644,
"number_of_digits": 325,
"number_of_newlines": 2,
"number_of_whitespaces": 378,
"number_of_words": 378,
"programming_language": "JavaScript",
[...]

Example Queries

object_type == "URL"
&& @has_descendant(object_type == "PE" || object_type == "LNK")
  • This query matches a URL object, which at some level has a descendant of PE or LNK type.
object_type == "URL"
&& @has_descendant(object_type == "Image" &&
(@match_object_meta($nsfw_verdict == "Hentai")
|| @match_object_meta($nsfw_verdict == "Sexy")
|| @match_object_meta($nsfw_verdict == "Porn"))
)
  • This query matches a URL, which at some level contains an Image object with NSFW content.

Configuration Options

  • window_size → Browser (virtual) window size. JavaScript performed in a browser window can query these dimensions. To avoid standard bot-detection mechanisms it's recommended to use a standard size. (default: [1920, 1200])
  • user_agent → User agent string for browser to use while performing HTTP requests. (default: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36")
  • accept_language → Optional "Accept-Language" HTTP header to use performing HTTP requests. (default: "en-US,en")
  • chrome_request_timeout_msec → Chromium's request timeout in milliseconds. The duration after a request with no response should time out.
  • idle_network_settle_time_msec → An interval of time in milliseconds which must pass with no-requests-in-progress state before considering that page is fully loaded. The higher this interval the higher probability that all elements on the page are fully loaded. On the other hand, every backend request would have an additional delay to ensure that there are no more network requests to fulfill before considering the page as fully loaded. (default: 5000)
  • proxy → Optional proxy server, in a format such as "192.168.0.254:3128" or "socks5://192.168.0.254:9050" (default: disabled)
  • max_instance_lifetime_seconds → Maximum interval of time in seconds for browser instance to run before being recycled by the backend. (default: 600)
  • max_backend_requests_per_instance → Maximum number of backend requests to process before recycling the browser. (default: 10)
  • take_screenshot → Specifies whether to take a screenshot and produce corresponding artifact after navigated to URL from a request. (default: true)
  • perform_print_to_pdf → Specifies whether to perform print-to-PDF and produce corresponding artifact after navigated to URL from a request. (default: true)
  • save_original_response → Specifies whether to save original HTTP-response-document and produce a corresponding artifact. HTML document in the browser window often differs from the original HTML document sent by the web server in a reply to the original HTTP request. The reason for this is usually a JavaScript code executed by a browser, which adds, modifies, updates the code in the original HTML document. From the analysis perspective the actual HTML page rendered in a browser's window is much more valuable compared to the original HTML page, as it represents what the end user is presented with. (default: true)
  • max_response_content_length → Maximum HTTP response body size in bytes (i.e. HTTP content-length) to allow browser to fetch. The HTTP response body could be compressed, so the limit applies to the body before decompression and decompressed body size could be larger then the specified limit. While this parameter allows to filter some responses before response body download begins, not all responses are subject for this limit, as content-length HTTP header is optional. (default: 512000)
  • max_response_data_length → Maximum allowed HTTP response data size in bytes (after HTTP transfer-encoding decompression). This limit is applied when response body is being received. When the received response body grows over the specified limit the response gets interrupted. (default: 1024000)
  • excluded_resource_types → An optional list of resource types, which shouldn't be further processed. (default: [ "Font", "Stylesheet" ])