URL
Supported formats
URL (primarily http(s)://)
Description
The backend relies on headless Chromium browser to access provided URL and its related objects, collect various information and interesting objects for further processing.
note
Available in Contextal Platform 1.0 and later.
warning
This backend is currently considered experimental and only URLs found in text coming from an OCR processing will be submitted for processing in Contextal Platform 1.0 (this can be turned off in the Text
backend). Due to an unknown nature of remote content it's highly recommended to configure safe/proxied networking for the container running the URL backend (or use the proxy
setting, see below).
Features
The backend collects the following data:
- contents of the given URL and its related objects (images, scripts, CSS, etc.), which from browser's standpoint are necessary to render the URL
- main document's HTML source code after processing by the browser (including JavaScript modifications, etc.)
- screenshot of the rendered web page
- web page saved via print-to-PDF browser function
- file-download, if the given URL triggers a download process in the browser
Symbols
Object
LIMITS_REACHED
→ limits triggered while processing the URL
Children
FETCH_INCOMPLETE
→ this child object couldn't be fully downloadedFETCH_ERROR
→ there was an error while trying to download the objectTIMEOUT
→ the request timed outTOOBIG
→ this child object was not downloaded as it exceeds the limitsTOOSMALL
→ this child object was skipped as it's smaller than 16 bytes (and not anImage
type)
Example Metadata
{
"org": "ctx",
"object_id": "0837397276194f38838b5c7680539d79fb20fbd374df5db091d476b4b4202dd2",
"object_type": "URL",
"object_subtype": null,
"recursion_level": 3,
"size": 22,
"hashes": {
"md5": "71949b4258268ad038b142d5fa749fdb",
"sha512": "87af4c5b0e5b8d9b9e7207e09920c9040bb749bb25256d7ccc92c6325b1450f1706d770617369bf2fbf4f92e9a9d36d67d97574904b7cdfc968103b6f008dd09",
"sha256": "0837397276194f38838b5c7680539d79fb20fbd374df5db091d476b4b4202dd2",
"sha1": "4e90d1597083e0f403b160ee5c341a12ca1899e5"
},
"ctime": 1718006523.390496,
"relation_metadata": {
"url": "https://contextal.com/"
},
"ok": {
"symbols": [
"LIMITS_REACHED"
],
"object_metadata": {
"_backend_version": "1.0.0"
},
"children": [
{
"org": "ctx",
"object_id": "e8d4c0691158e2df65f02d2f8e2d9437fb260d0dfd7de609284ff5dc4058d418",
"object_type": "PDF",
"object_subtype": null,
"recursion_level": 4,
"size": 921644,
"hashes": {
"sha512": "9637f21b0964449177d66ef585b056d53ddd9ae7e54df08e675eadf4737395945096fa031e83c26bafec539edc1e798ff55fa725e41e4db469f50b6087dfab53",
"md5": "83630b69f8b2d376738978c753795241",
"sha256": "e8d4c0691158e2df65f02d2f8e2d9437fb260d0dfd7de609284ff5dc4058d418",
"sha1": "6aa2b2a8c09d39981cecbcffa658a0bc689589ba"
},
"ctime": 1718006523.390496,
"relation_metadata": {
"PrintToPdf": {
"url": "https://contextal.com/"
}
},
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"builtin_metadata": {
"creator": "Chromium",
"producer": "Skia/PDF m125",
"title": "Contextal – Security in the Right Context"
},
[...]
{
"org": "ctx",
"object_id": "c499179d7c7e273ff061f5d53996ed54a2058bc8d04d52de19709d178777b12f",
"object_type": "HTML",
"object_subtype": null,
"recursion_level": 4,
"size": 252529,
"hashes": {
"md5": "6b7f269944288ed494b08bda2d490c40",
"sha1": "b512c137b1646b19f15b14dcab82c526afc464f6",
"sha512": "8be7b504c49a5d805a227121884f631e5d62d70f7badfcf7d8884ebfc0db78bba2ff5fb90e1d72881d983b69e5285c77caf480df983e2e9f8b3c925706193a5e",
"sha256": "c499179d7c7e273ff061f5d53996ed54a2058bc8d04d52de19709d178777b12f"
},
"ctime": 1718006523.390496,
"relation_metadata": {
"PageHtmlContent": {
"title": "Contextal – Security in the Right Context",
"url": "https://contextal.com/"
}
},
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"encoding": "utf-8",
"forms": [
{
"action": "/#wpcf7-f198-o1",
"aria-label": "Contact form",
"class": "wpcf7-form mailchimp-ext-0.5.72 init",
"data-status": "init",
"method": "post",
"novalidate": "novalidate"
}
],
[...]
{
"org": "ctx",
"object_id": "039bfe8723a47a07b1770cf171fef9b40074b85636cb4f52b525fabf2ee868db",
"object_type": "Text",
"object_subtype": null,
"recursion_level": 4,
"size": 10644,
"hashes": {
"sha512": "abdf4ad25aeabd65c3962614387cf478397e28a5395a1609c5e2e0714a33b4e3374edd7da86b802ab429914cb7b32c20e72f2c1a2dc6ae85d397e95c137b096c",
"sha256": "039bfe8723a47a07b1770cf171fef9b40074b85636cb4f52b525fabf2ee868db",
"md5": "c0c3501afc3829e6dd70f03c201c8f3f",
"sha1": "8e5b0006f279b1c712b319e92b9ac9d92c4b5223"
},
"ctime": 1718006523.390496,
"relation_metadata": {
"HttpResponse": {
"mime_type": "application/javascript",
"resource_type": "Script",
"status_code": 200,
"status_text": "",
"url": "https://contextal.com/wp-content/14eccb22cbabbcc8a23850eb2e2889d4/dist/2079540005.js?ver=5e6956684b86fca9"
}
},
"ok": {
"symbols": [
"ALL_ASCII"
],
"object_metadata": {
"_backend_version": "1.0.0",
"encoding": "utf-8",
"number_of_ascii_range_chars": 10644,
"number_of_characters": 10644,
"number_of_digits": 325,
"number_of_newlines": 2,
"number_of_whitespaces": 378,
"number_of_words": 378,
"programming_language": "JavaScript",
[...]
Example Queries
object_type == "URL"
&& @has_descendant(object_type == "PE" || object_type == "LNK")
object_type == "URL"
&& @has_descendant(object_type == "Image" &&
(@match_object_meta($nsfw_verdict == "Hentai")
|| @match_object_meta($nsfw_verdict == "Sexy")
|| @match_object_meta($nsfw_verdict == "Porn"))
)
- This query matches a
URL
, which at some level contains anImage
object with NSFW content.
Configuration Options
window_size
→ Browser (virtual) window size. JavaScript performed in a browser window can query these dimensions. To avoid standard bot-detection mechanisms it's recommended to use a standard size. (default: [1920, 1200])user_agent
→ User agent string for browser to use while performing HTTP requests. (default: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36")accept_language
→ Optional "Accept-Language" HTTP header to use performing HTTP requests. (default: "en-US,en")chrome_request_timeout_msec
→ Chromium's request timeout in milliseconds. The duration after a request with no response should time out.idle_network_settle_time_msec
→ An interval of time in milliseconds which must pass with no-requests-in-progress state before considering that page is fully loaded. The higher this interval the higher probability that all elements on the page are fully loaded. On the other hand, every backend request would have an additional delay to ensure that there are no more network requests to fulfill before considering the page as fully loaded. (default: 5000)proxy
→ Optional proxy server, in a format such as "192.168.0.254:3128" or "socks5://192.168.0.254:9050" (default: disabled)max_instance_lifetime_seconds
→ Maximum interval of time in seconds for browser instance to run before being recycled by the backend. (default: 600)max_backend_requests_per_instance
→ Maximum number of backend requests to process before recycling the browser. (default: 10)take_screenshot
→ Specifies whether to take a screenshot and produce corresponding artifact after navigated to URL from a request. (default: true)perform_print_to_pdf
→ Specifies whether to perform print-to-PDF and produce corresponding artifact after navigated to URL from a request. (default: true)save_original_response
→ Specifies whether to save original HTTP-response-document and produce a corresponding artifact. HTML document in the browser window often differs from the original HTML document sent by the web server in a reply to the original HTTP request. The reason for this is usually a JavaScript code executed by a browser, which adds, modifies, updates the code in the original HTML document. From the analysis perspective the actual HTML page rendered in a browser's window is much more valuable compared to the original HTML page, as it represents what the end user is presented with. (default: true)max_response_content_length
→ Maximum HTTP response body size in bytes (i.e. HTTP content-length) to allow browser to fetch. The HTTP response body could be compressed, so the limit applies to the body before decompression and decompressed body size could be larger then the specified limit. While this parameter allows to filter some responses before response body download begins, not all responses are subject for this limit, as content-length HTTP header is optional. (default: 512000)max_response_data_length
→ Maximum allowed HTTP response data size in bytes (after HTTP transfer-encoding decompression). This limit is applied when response body is being received. When the received response body grows over the specified limit the response gets interrupted. (default: 1024000)excluded_resource_types
→ An optional list of resource types, which shouldn't be further processed. (default: [ "Font", "Stylesheet" ])