PDF

Supported formats

PDF format 1.x

Description

The backend performs a deep analysis of PDF files, extracting text and other important information, performing OCR if necessary.

info

Available in Contextal Platform 1.0 and later.

Features

Detailed Inspection

The following data is extracted/obtained by the backend:

document's PDF standard version
PDF document built-in metadata
list of font names used in the document
PDF form type
hashes of embedded page thumbnails
document page/paper dimensions
counters for various types of PDF annotations, links, objects, attachments, cryptographic signatures and bookmarks
rendered document pages
document text obtained from (optional) text objects in a PDF document
document text produced by performing OCR on rendered document pages
text from annotations
image objects from document pages
files attached to PDF document
cryptographic signatures of PDF document

Auto-Decryption Support

The backend is capable of performing contextual auto-decryption of encrypted documents.

Symbols

Object

ISSUES → issues were faced while processing the document
LIMITS_REACHED → limits triggered while processing the document
MAX_ANNOTATIONS_REACHED → reached the limit of annotations in the document
MAX_ATTACHMENTS_REACHED → reached the limit of attachments in the document
NOTEXT → the document doesn't contain a plain text
MAX_BOOKMARKS_REACHED → reached the limit of bookmarks in the document
MAX_FONTS_PER_PAGE_REACHED → reached the limit of fonts per page
MAX_LINKS_REACHED → reached the limit of links in the document
MAX_PAGES_REACHED → reached the limit of pages in the document
MAX_OBJECTS_REACHED → reached the limit of objects in the document
MAX_OBJECT_DEPTH_REACHED → reached the limit of object depth
MAX_SIGNATURES_REACHED → reached the limit of signatures in the document

Children

OCR → the document contains text recognized with OCR
TOOBIG → this child object was not extracted as it exceeds the limits
INVALID_SIGNING_DATE → invalid date of a digital signature
MISSING_SIGNING_DATE → missing date of a digital signature
FALLBACK_TO_RAW_IMAGE → a raw image was stored instead of a processed one

Example Metadata

{
   "org": "ctx",
   "object_id": "2e2f885186c698ff368611e326115749855bc30e3fa482ef36abd765fc9af5eb",
   "object_type": "PDF",
   "object_subtype": null,
   "recursion_level": 1,
   "size": 452811,
   "hashes": {
      "sha1": "072cbc49059dc19744a7aa366d284a83531892b4",
      "md5": "762068ccbb12e7a6ae56d5a48cabbf20",
      "sha256": "2e2f885186c698ff368611e326115749855bc30e3fa482ef36abd765fc9af5eb",
      "sha512": "48bf01d1a2835ba79c23a5df0b269fe46ff4e09e67bcab03c56d3738904fe3d928222888bf32f447f054b69f213a244264dea77c6380c37ab9b6863aa4f817e5"
   },
   "ctime": 1713531404.041442,
   "relation_metadata": {},
   "ok": {
      "symbols": [],
      "object_metadata": {
         "_backend_version": "1.0.0",
         "builtin_metadata": {
            "creation_date": {
               "parsed": [ 2018, 346, 15, 24, 32, 0, 1, 0, 0 ],
               "raw": "D:20181212152432+01'00'"
            },
            "creator": "Adobe InDesign CS6 (Macintosh)",
            "producer": "ilovepdf.com"
         },
         "embedded_thumbnails": [],
         "fonts": [
            "AGJFRV+Calibri",
            "AHSTZY+Calibri",
            "AIUPOB+Calibri-Italic",
            "ANWCLU+Calibri-Bold",
            "AYYLTU+Calibri-Bold",
            "BANMMX+Calibri-BoldItalic",
            "MinionPro-Regular"
         ],
         "form_type": "None",
         "number_of_annotations": {
            "errors": 0,
            "link": 0,
            "other": 0,
            "popup": 0,
            "text": 0,
            "total": 0,
            "unsupported": 0,
            "widget": 0,
            "xfa_widget": 0
         },
         "number_of_attachments": {
            "errors": 0,
            "total": 0
         },
         "number_of_bookmarks": {
            "errors": 0,
            "total": 0,
            "with_uris": 0
         },
         "number_of_links": {
            "errors": 0,
            "total": 0,
            "with_action_embedded": 0,
            "with_action_launch": 0,
            "with_action_local": 0,
            "with_action_remote": 0,
            "with_action_unsupported": 0,
            "with_action_uri": 0
         },
         "number_of_objects": {
            "errors": 0,
            "form_xobjects": 6,
            "images": 55,
            "shadings": 0,
            "texts": 814,
            "total": 970,
            "unsupported": 0,
            "vector_paths": 95
         },
         "number_of_pages": 20,
         "number_of_signatures": {
            "errors": 0,
            "total": 0
         },
         "paper_sizes_mm": [
            {
               "height": 210,
               "standard_name": "A5",
               "width": 148
            }
         ],
         "uris": [],
         "version": "1.4"
      },
      "children": [
         {
            "org": "ctx",
            "object_id": "c9ab2da0b8588729c258b5a3c7ae98e60bbed881b01e70509e85a48494df105d",
            "object_type": "Text",
            "object_subtype": null,
            "recursion_level": 2,
            "size": 17671,
[...]

Example Queries

object_type == "PDF"
    && @has_symbol("NOTEXT")

This query matches a PDF document, which doesn't contain a plain text.

object_type == "PDF" && @has_child(
    object_type == "Text" && @has_symbol("OCR")
    && @has_child(object_type == "URL"
        && @match_relation_meta($url starts_with("https://"))
        && @count_children() > 20
    )
)

This matches a PDF, from which a text gets extracted through OCR, and that text contains a URL starting with "https://", which gets processed and delivers over 20 children objects.

Configuration Options

max_processed_size → maximum size of the input object that will be processed (default: 262144000)
max_objects → maximum number of internal objects to be processed (default: 262144)
max_object_depth → maximum depth of the objects (default: 16)
max_pages → maximum number of pages to be processed (default: 2048)
max_bookmarks → maximum number of bookmarks to be processed (default: 8192)
max_annotations → maximum number of annotations to be processed (default: 16384)
max_attachments → maximum number of attachments to be processed (default: 128)
max_attachment_size → maximum attachment size (default: 33554432)
max_fonts_per_page → maximum number of fonts per page (default: 128)
max_links → maximum number of links (default: 2048)
max_signatures → maximum number of embedded signatures (default: 32) - max_signature_size → maximum size of a single signature (default: 262144)
render_pages → whether to render all pages (default: false). The pages may still be rendered in some cases, if the OCR settings require that.
render_page_width → page width for the rendered page (default: 1920)
render_page_height → page height for the rendered page (default: auto)
save_image_objects → whether to save images for further processing (default: true)
output_image_format → format of the output image; valid settings are: "png", "jpg", "bmp", "webp", and "tiff" (default: "png")
ocr_mode → valid OCR modes are: "Never" (disable OCR), "IfNoDocumentTextAvailable" (perform OCR if the document doesn't contain a plain text), and "Always" (always perform OCR) (default: "IfNoDocumentTextAvailable")

Supported formats​

Description​

Features​

Detailed Inspection​

Auto-Decryption Support​

Symbols​

Object​

Children​

Example Metadata​

Example Queries​

Configuration Options​