Supported formats
PDF format 1.x
Description
The backend performs a deep analysis of PDF files, extracting text and other important information, performing OCR if necessary.
info
Available in Contextal Platform 1.0 and later.
Features
Detailed Inspection
The following data is extracted/obtained by the backend:
- document's PDF standard version
- PDF document built-in metadata
- list of font names used in the document
- PDF form type
- hashes of embedded page thumbnails
- document page/paper dimensions
- counters for various types of PDF annotations, links, objects, attachments, cryptographic signatures and bookmarks
- rendered document pages
- document text obtained from (optional) text objects in a PDF document
- document text produced by performing OCR on rendered document pages
- text from annotations
- image objects from document pages
- files attached to PDF document
- cryptographic signatures of PDF document
Auto-Decryption Support
The backend is capable of performing contextual auto-decryption of encrypted documents.
Symbols
Object
ISSUES
→ issues were faced while processing the documentLIMITS_REACHED
→ limits triggered while processing the documentMAX_ANNOTATIONS_REACHED
→ reached the limit of annotations in the documentMAX_ATTACHMENTS_REACHED
→ reached the limit of attachments in the documentNOTEXT
→ the document doesn't contain a plain textMAX_BOOKMARKS_REACHED
→ reached the limit of bookmarks in the documentMAX_FONTS_PER_PAGE_REACHED
→ reached the limit of fonts per pageMAX_LINKS_REACHED
→ reached the limit of links in the documentMAX_PAGES_REACHED
→ reached the limit of pages in the documentMAX_OBJECTS_REACHED
→ reached the limit of objects in the documentMAX_OBJECT_DEPTH_REACHED
→ reached the limit of object depthMAX_SIGNATURES_REACHED
→ reached the limit of signatures in the document
Children
OCR
→ the document contains text recognized with OCRTOOBIG
→ this child object was not extracted as it exceeds the limitsINVALID_SIGNING_DATE
→ invalid date of a digital signatureMISSING_SIGNING_DATE
→ missing date of a digital signatureFALLBACK_TO_RAW_IMAGE
→ a raw image was stored instead of a processed one
Example Metadata
{
"org": "ctx",
"object_id": "2e2f885186c698ff368611e326115749855bc30e3fa482ef36abd765fc9af5eb",
"object_type": "PDF",
"object_subtype": null,
"recursion_level": 1,
"size": 452811,
"hashes": {
"sha1": "072cbc49059dc19744a7aa366d284a83531892b4",
"md5": "762068ccbb12e7a6ae56d5a48cabbf20",
"sha256": "2e2f885186c698ff368611e326115749855bc30e3fa482ef36abd765fc9af5eb",
"sha512": "48bf01d1a2835ba79c23a5df0b269fe46ff4e09e67bcab03c56d3738904fe3d928222888bf32f447f054b69f213a244264dea77c6380c37ab9b6863aa4f817e5"
},
"ctime": 1713531404.041442,
"relation_metadata": {},
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"builtin_metadata": {
"creation_date": {
"parsed": [ 2018, 346, 15, 24, 32, 0, 1, 0, 0 ],
"raw": "D:20181212152432+01'00'"
},
"creator": "Adobe InDesign CS6 (Macintosh)",
"producer": "ilovepdf.com"
},
"embedded_thumbnails": [],
"fonts": [
"AGJFRV+Calibri",
"AHSTZY+Calibri",
"AIUPOB+Calibri-Italic",
"ANWCLU+Calibri-Bold",
"AYYLTU+Calibri-Bold",
"BANMMX+Calibri-BoldItalic",
"MinionPro-Regular"
],
"form_type": "None",
"number_of_annotations": {
"errors": 0,
"link": 0,
"other": 0,
"popup": 0,
"text": 0,
"total": 0,
"unsupported": 0,
"widget": 0,
"xfa_widget": 0
},
"number_of_attachments": {
"errors": 0,
"total": 0
},
"number_of_bookmarks": {
"errors": 0,
"total": 0,
"with_uris": 0
},
"number_of_links": {
"errors": 0,
"total": 0,
"with_action_embedded": 0,
"with_action_launch": 0,
"with_action_local": 0,
"with_action_remote": 0,
"with_action_unsupported": 0,
"with_action_uri": 0
},
"number_of_objects": {
"errors": 0,
"form_xobjects": 6,
"images": 55,
"shadings": 0,
"texts": 814,
"total": 970,
"unsupported": 0,
"vector_paths": 95
},
"number_of_pages": 20,
"number_of_signatures": {
"errors": 0,
"total": 0
},
"paper_sizes_mm": [
{
"height": 210,
"standard_name": "A5",
"width": 148
}
],
"uris": [],
"version": "1.4"
},
"children": [
{
"org": "ctx",
"object_id": "c9ab2da0b8588729c258b5a3c7ae98e60bbed881b01e70509e85a48494df105d",
"object_type": "Text",
"object_subtype": null,
"recursion_level": 2,
"size": 17671,
[...]
Example Queries
object_type == "PDF"
&& @has_symbol("NOTEXT")
- This query matches a
PDF
document, which doesn't contain a plain text.
object_type == "PDF" && @has_child(
object_type == "Text" && @has_symbol("OCR")
&& @has_child(object_type == "URL"
&& @match_relation_meta($url starts_with("https://"))
&& @count_children() > 20
)
)
- This matches a
PDF
, from which a text gets extracted throughOCR
, and that text contains aURL
starting with "https://", which gets processed and delivers over 20 children objects.
Configuration Options
max_processed_size
→ maximum size of the input object that will be processed (default: 262144000)max_objects
→ maximum number of internal objects to be processed (default: 262144)max_object_depth
→ maximum depth of the objects (default: 16)max_pages
→ maximum number of pages to be processed (default: 2048)max_bookmarks
→ maximum number of bookmarks to be processed (default: 8192)max_annotations
→ maximum number of annotations to be processed (default: 16384)max_attachments
→ maximum number of attachments to be processed (default: 128)max_attachment_size
→ maximum attachment size (default: 33554432)max_fonts_per_page
→ maximum number of fonts per page (default: 128)max_links
→ maximum number of links (default: 2048)max_signatures
→ maximum number of embedded signatures (default: 32) -max_signature_size
→ maximum size of a single signature (default: 262144)render_pages
→ whether to render all pages (default: false). The pages may still be rendered in some cases, if the OCR settings require that.render_page_width
→ page width for the rendered page (default: 1920)render_page_height
→ page height for the rendered page (default: auto)save_image_objects
→ whether to save images for further processing (default: true)output_image_format
→ format of the output image; valid settings are: "png", "jpg", "bmp", "webp", and "tiff" (default: "png")ocr_mode
→ valid OCR modes are: "Never" (disable OCR), "IfNoDocumentTextAvailable" (perform OCR if the document doesn't contain a plain text), and "Always" (always perform OCR) (default: "IfNoDocumentTextAvailable")