Skip to main content

PDF

Supported formats

PDF format 1.x

Description

The backend performs a deep analysis of PDF files, extracting text and other important information, performing OCR if necessary.

info

Available in Contextal Platform 1.0 and later.

Features

Detailed Inspection

The following data is extracted/obtained by the backend:

  • document's PDF standard version
  • PDF document built-in metadata
  • list of font names used in the document
  • PDF form type
  • hashes of embedded page thumbnails
  • document page/paper dimensions
  • counters for various types of PDF annotations, links, objects, attachments, cryptographic signatures and bookmarks
  • rendered document pages
  • document text obtained from (optional) text objects in a PDF document
  • document text produced by performing OCR on rendered document pages
  • text from annotations
  • image objects from document pages
  • files attached to PDF document
  • cryptographic signatures of PDF document

Auto-Decryption Support

The backend is capable of performing contextual auto-decryption of encrypted documents.

Symbols

Object

  • ISSUES → issues were faced while processing the document
  • LIMITS_REACHED → limits triggered while processing the document
  • MAX_ANNOTATIONS_REACHED → reached the limit of annotations in the document
  • MAX_ATTACHMENTS_REACHED → reached the limit of attachments in the document
  • NOTEXT → the document doesn't contain a plain text
  • MAX_BOOKMARKS_REACHED → reached the limit of bookmarks in the document
  • MAX_FONTS_PER_PAGE_REACHED → reached the limit of fonts per page
  • MAX_LINKS_REACHED → reached the limit of links in the document
  • MAX_PAGES_REACHED → reached the limit of pages in the document
  • MAX_OBJECTS_REACHED → reached the limit of objects in the document
  • MAX_OBJECT_DEPTH_REACHED → reached the limit of object depth
  • MAX_SIGNATURES_REACHED → reached the limit of signatures in the document

Children

  • OCR → the document contains text recognized with OCR
  • TOOBIG → this child object was not extracted as it exceeds the limits
  • INVALID_SIGNING_DATE → invalid date of a digital signature
  • MISSING_SIGNING_DATE → missing date of a digital signature
  • FALLBACK_TO_RAW_IMAGE → a raw image was stored instead of a processed one

Example Metadata

{
"org": "ctx",
"object_id": "2e2f885186c698ff368611e326115749855bc30e3fa482ef36abd765fc9af5eb",
"object_type": "PDF",
"object_subtype": null,
"recursion_level": 1,
"size": 452811,
"hashes": {
"sha1": "072cbc49059dc19744a7aa366d284a83531892b4",
"md5": "762068ccbb12e7a6ae56d5a48cabbf20",
"sha256": "2e2f885186c698ff368611e326115749855bc30e3fa482ef36abd765fc9af5eb",
"sha512": "48bf01d1a2835ba79c23a5df0b269fe46ff4e09e67bcab03c56d3738904fe3d928222888bf32f447f054b69f213a244264dea77c6380c37ab9b6863aa4f817e5"
},
"ctime": 1713531404.041442,
"relation_metadata": {},
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"builtin_metadata": {
"creation_date": {
"parsed": [ 2018, 346, 15, 24, 32, 0, 1, 0, 0 ],
"raw": "D:20181212152432+01'00'"
},
"creator": "Adobe InDesign CS6 (Macintosh)",
"producer": "ilovepdf.com"
},
"embedded_thumbnails": [],
"fonts": [
"AGJFRV+Calibri",
"AHSTZY+Calibri",
"AIUPOB+Calibri-Italic",
"ANWCLU+Calibri-Bold",
"AYYLTU+Calibri-Bold",
"BANMMX+Calibri-BoldItalic",
"MinionPro-Regular"
],
"form_type": "None",
"number_of_annotations": {
"errors": 0,
"link": 0,
"other": 0,
"popup": 0,
"text": 0,
"total": 0,
"unsupported": 0,
"widget": 0,
"xfa_widget": 0
},
"number_of_attachments": {
"errors": 0,
"total": 0
},
"number_of_bookmarks": {
"errors": 0,
"total": 0,
"with_uris": 0
},
"number_of_links": {
"errors": 0,
"total": 0,
"with_action_embedded": 0,
"with_action_launch": 0,
"with_action_local": 0,
"with_action_remote": 0,
"with_action_unsupported": 0,
"with_action_uri": 0
},
"number_of_objects": {
"errors": 0,
"form_xobjects": 6,
"images": 55,
"shadings": 0,
"texts": 814,
"total": 970,
"unsupported": 0,
"vector_paths": 95
},
"number_of_pages": 20,
"number_of_signatures": {
"errors": 0,
"total": 0
},
"paper_sizes_mm": [
{
"height": 210,
"standard_name": "A5",
"width": 148
}
],
"uris": [],
"version": "1.4"
},
"children": [
{
"org": "ctx",
"object_id": "c9ab2da0b8588729c258b5a3c7ae98e60bbed881b01e70509e85a48494df105d",
"object_type": "Text",
"object_subtype": null,
"recursion_level": 2,
"size": 17671,
[...]

Example Queries

object_type == "PDF"
&& @has_symbol("NOTEXT")
  • This query matches a PDF document, which doesn't contain a plain text.
object_type == "PDF" && @has_child(
object_type == "Text" && @has_symbol("OCR")
&& @has_child(object_type == "URL"
&& @match_relation_meta($url starts_with("https://"))
&& @count_children() > 20
)
)
  • This matches a PDF, from which a text gets extracted through OCR, and that text contains a URL starting with "https://", which gets processed and delivers over 20 children objects.

Configuration Options

  • max_processed_size → maximum size of the input object that will be processed (default: 262144000)
  • max_objects → maximum number of internal objects to be processed (default: 262144)
  • max_object_depth → maximum depth of the objects (default: 16)
  • max_pages → maximum number of pages to be processed (default: 2048)
  • max_bookmarks → maximum number of bookmarks to be processed (default: 8192)
  • max_annotations → maximum number of annotations to be processed (default: 16384)
  • max_attachments → maximum number of attachments to be processed (default: 128)
  • max_attachment_size → maximum attachment size (default: 33554432)
  • max_fonts_per_page → maximum number of fonts per page (default: 128)
  • max_links → maximum number of links (default: 2048)
  • max_signatures → maximum number of embedded signatures (default: 32) - max_signature_size → maximum size of a single signature (default: 262144)
  • render_pages → whether to render all pages (default: false). The pages may still be rendered in some cases, if the OCR settings require that.
  • render_page_width → page width for the rendered page (default: 1920)
  • render_page_height → page height for the rendered page (default: auto)
  • save_image_objects → whether to save images for further processing (default: true)
  • output_image_format → format of the output image; valid settings are: "png", "jpg", "bmp", "webp", and "tiff" (default: "png")
  • ocr_mode → valid OCR modes are: "Never" (disable OCR), "IfNoDocumentTextAvailable" (perform OCR if the document doesn't contain a plain text), and "Always" (always perform OCR) (default: "IfNoDocumentTextAvailable")