Tutorial

Optimizing PDF extraction performance with Sensible

Frances Elliott
Monday, August 23, 2021

Related Documents

A previous tutorial covered extracting structured data from documents using SenseML. Let's switch gears to optimizing your data extraction. 

 What we'll cover

  • What impacts Sensible performance?
  • Rewrite SenseML queries for faster performance
  • Preferentially run or skip collections of queries ("configs") based on key text in documents. 

What impacts Sensible performance?

First, let's clarify what doesn't impact performance: the number of documents you submit has virtually no effect on processing time. Each document gets its own worker in parallel, whether you submit one or 50,000 documents. Instead, you can optimize:

  • document performance
  • document type performance

Document performance

In an ideal performance scenario, you extract data from digitally generated PDFs using only text-based or coordinate-based SenseML methods, such as Label, Row, Region, Text Table, and Document Range.

In the real world, things are never that simple. In order of slowest to quickest, these factors add seconds to doc processing:

Over 10 seconds per document 

Whole-document OCR (for scanned documents)  

Sensible takes 10 seconds or more to OCR an entire document. You can speed OCR up for shorter documents (5 pages or fewer) by choosing Sensible's Google OCR option.

Whole-document table recognition

Avoid configuring Sensible to search a whole document for tables. For a tutorial, see the "Add a Stop" section in this post.

Under 5 seconds per document

Selective OCR 

Some documents mix digital text with text images, for example by embedding scanned pages in a digital PDF. Speed this up by OCRing select pages, not the whole document. For more information, see the docs.

Selective table recognition

Sensible process tables that include a stop in less than 5 seconds. Or, convert to a faster method that skips table recognition. For a tutorial, see "Add a Stop" and "Convert to faster query" sections in this post.

Under 1 second per document

Some SenseML methods use pixels, for example to recognize borders. However, pixel recognition requires rendering a PDF page, which can take a couple hundred milliseconds. To improve processing time, use coordinate-based alternatives to these methods. 

Boxes

To improve processing speed, convert the more flexible Box method to the strictly coordinate-based Region method.

Signature, checkbox, image coordinate extraction

There are no alternative methods for signatures, checkboxes, and images. However, see the following section for ways to avoid running these methods except when absolutely necessary. 

Document type performance

By default, Sensible runs all the configs in a document type before choosing the best one for a given document.  If your document type contains many different configs with computationally expensive methods such as Table or Box, you can improve performance by selectively running and skipping configs. For a tutorial, see the Skip queries section later in this post.

Enough overview! Let's dive into some real-world optimizing.

Prerequisites

  • You’ll need an account for Sensible.  Or, read along for a rough idea of how things work.
{
 
    "fields": [
      {
        "id": "loss_history",
        "anchor": "enter all claims or losses",
        "type": "table",
        "method": {
          "id": "fixedTable",
          "columnCount": 8,
          "columns": [
            {
              "id": "date_of_occurence",
              "type": "date",
              "index": 0,
              "isRequired": true
            },
            {
              "id": "line",
              "index": 1
            },
            {
              "id": "description",
              "index": 2
            },
            {
              "id": "date_of_claim",
              "type": "date",
              "index": 3
            },
            {
              "id": "amount_paid",
              "type": "currency",
              "index": 4
            },
            {
              "id": "amount_reserved",
              "type": "currency",
              "index": 5
            },
            {
              "id": "claim_status_open",
              "index": 6
            },
            {
              "id": "claim_status_closed",
              "index": 7
            }
          ],
        }
      }
    ]
  }

You should see that Sensible recognizes the table (green box):

And you should see the following data extracted from the "loss history" table in the output pane:


{
  "loss_history": {
    "columns": [
      {
        "id": "date_of_occurence",
        "values": [
          {
            "source": "10/16/2020",
            "value": "2020-10-16T00:00:00.000Z",
            "type": "date"
          },
          {
            "source": "07/12/2019",
            "value": "2019-07-12T00:00:00.000Z",
            "type": "date"
          }
        ]
      },
      {
        "id": "line",
        "values": [
          {
            "value": "PROP",
            "type": "string"
          },
          {
            "value": "PROP",
            "type": "string"
          }
        ]
      },
      {
        "id": "description",
        "values": [
          {
            "value": "Fire damage, 2020.",
            "type": "string"
          },
          {
            "value": "Burglary loss, 2019",
            "type": "string"
          }
        ]
      },
      {
        "id": "date_of_claim",
        "values": [
          {
            "source": "10/17/2020",
            "value": "2020-10-17T00:00:00.000Z",
            "type": "date"
          },
          {
            "source": "07/13/2019",
            "value": "2019-07-13T00:00:00.000Z",
            "type": "date"
          }
        ]
      },
      {
        "id": "amount_paid",
        "values": [
          {
            "source": "$ 10,000",
            "value": 10000,
            "unit": "$",
            "type": "currency"
          },
          {
            "source": "$ 5,000",
            "value": 5000,
            "unit": "$",
            "type": "currency"
          }
        ]
      },
      {
        "id": "amount_reserved",
        "values": [
          null,
          null
        ]
      },
      {
        "id": "claim_status_open",
        "values": [
          {
            "value": "",
            "type": "string"
          },
          {
            "value": "",
            "type": "string"
          }
        ]
      },
      {
        "id": "claim_status_closed",
        "values": [
          {
            "value": "N",
            "type": "string"
          },
          {
            "value": "N",
            "type": "string"
          }
        ]
      }
    ]
  }
}


Optimize query speed

The Fixed Table method in the preceding example is convenient when the table column layout never varies. But for long documents, the defaults for Fixed Table can result in slower performance. So let's optimize it!

Best practice: table stops

If you don't define the end of the table, Sensible runs table recognition on all the pages in the document. This impacts performance for long documents.

To define the table stop, append the following Stop parameter after your column array:

"stop": {
  "text": "signature",
  "type": "startsWith"
}

This specifies to stop table recognition as soon as Sensible encounters a line that starts with the text "signature":

Stop table OCR

Convert to a faster query

For faster extraction, skip table recognition by replacing the Fixed Table method with the Text Table method.

Replace your "acord_125_test" config with the following SenseML:


{
  "fields": [
    {
      "id": "loss_history",
      "type": "table",
      "anchor": "loss history",
      "method": {
        "id": "textTable",
        "columns": [
          {
            "id": "date_of_occurrence",
            "type": "date",
            "minX": 0.4,
            "maxX": 1.2,
            "isRequired": true
          },
          {
            "id": "line",
            "minX": 1.2,
            "maxX": 1.8
          },
          {
            "id": "description",
            "minX": 1.8,
            "maxX": 4.4
          },
          {
            "id": "date_of_claim",
            "type": "date",
            "minX": 4.4,
            "maxX": 5.15
          },
          {
            "id": "amount_paid",
            "type": "currency",
            "minX": 5.15,
            "maxX": 6.25
          },
          {
            "id": "amount_reserved",
            "type": "currency",
            "minX": 6.25,
            "maxX": 7.3
          },
          {
            "id": "subrogation",
            "minX": 7.3,
            "maxX": 7.7
          },
          {
            "id": "claim_status",
            "minX": 7.7,
            "maxX": 8.1
          }
        ],
        "stop": {
          "text": "signature",
          "type": "startsWith"
        }
      }
    }
  ]
}

As the preceding code sample shows, the Text Table method defines columns using coordinates. To determine these coordinates, click the table heading the Sensible app to display the heading line coordinates.
This code sample also uses a Split Lines preprocessor; otherwise the strict Text Table method stumbles on a couple of overmerged lines in the table.

Selectively skip queries

Imagine you process more than one type of application under an "acord_application_test" document type. For example, you process:

  • ACORD 125 - Commercial Insurance Application
  • ACORD 130 - Workers Compensation Application

In this situation, Sensible automatically chooses the best config for each document, so you don't have to specify "acord_125" or "acord_130" in your API calls. Behind the scenes, Sensible runs all the configs in the "acord_application_test" document type and chooses the output with the highest percentage of non-null values. 

But what if you don't want Sensible to run all the configs? It might slow you down, especially if your configs contain computationally expensive methods like Table and Box.

In that case, you add a fingerprint to a config before the Fields array. A fingerprint tests whether a document contains matching text before skipping or running a config.  For example, you can test that a document is an ACORD 125 with the following fingerprint:

 
  "fingerprint": {
    "tests": [
      {
        "type": "equals",
        "text": "2013/01"
      },
      {
        "type": "equals",
        "text": "acord 125"
      }
    ]
  },


With this fingerprint, Sensible preferentially:

  • runs the  "acord_125_test"  config on a document if it finds at least 50% of the strings defined in the fingerprint tests
  • skips configs with no fingerprint

Get optimizing!

Congratulations, you've learned about some key methods for optimizing your PDF extractions. Check out our docs, and sign up for Sensible trial to start extracting from your own documents.

Get Sensible — The powerful document query language that provides full control over the parsing process
Get early access
Request sent
Oops! Something went wrong while submitting the form.