how-to

How to extract data from closing disclosures

Frances Elliott
Thursday, October 13, 2022

Use SenseML to extract structured data from a mortgage loan closing disclosure PDF.

Related Documents

If you’re building software in proptech, chances are that you’ll have come across the closing disclosure. A closing disclosure contains the final details about the home buyer’s mortgage – things like loan terms, projected monthly payments, and the closing cost. 

The primary reason to pull data from a closing disclosure is to keep track of changes in the closing process to ensure that all parties are kept up-to-date. Outside of that, the data from closing disclosures can be used in aggregate to create a more accurate and complete picture of the mortgage market. For example, by knowing the details of recently closed mortgages, companies involved in the home-buying process can help potential buyers predict how much house they’ll be able to afford and help them evaluate whether a particular lender’s terms are favorable compared to the market. 

The information found in a closing disclosure isn’t always easily accessible. Closing disclosure data isn’t usually available through an API.

Fortunately, with Sensible you can easily extract key information out of closing disclosure PDFs using SenseML, Sensible’s query language for extracting data from documents. We’ve written a library of open-source SenseML configurations, so you don’t need to write queries from scratch for common documents. From there, your closing disclosure data is accessible via API, Sensible’s UI, or 5,000 other software integrations thanks to Zapier.


What we'll cover

This blog post briefly walks you through configuring extractions for closing disclosures. By the end, you’ll know a couple of SenseML methods and you’ll be on your way to extracting any data you choose using our documentation or our prebuilt open-source closing disclosure configurations.

Write document extraction queries with SenseML

Let's walk through extracting specific pieces of data from a mortgage closing disclosure. Here's an example of a closing disclosure PDF with redacted data:

Mortgage closing disclosure


To follow along, you can signup for a Sensible account, then download an example PDF and upload it to the Sensible app, or import the PDF and prebuilt open-source closing disclosure configurations directly to the Sensible app.

Our configuration for closing disclosure extractions is comprehensive for this PDF, but for the example in this post, let's keep it simple. We'll extract just the:

  • Date issued
  • Loan type
  • Estimated escrow
  • Table of borrowers' transactions

Extract date issued

See the following screenshot for an overview of how to extract the date issued:

Extract date issued (left pane: query. middle pane: document. right pane: output)

The query in the left pane in the preceding image treats the string "date issued" as the first cell in a row, and searches to the right of it for the date. The PDF is displayed in the middle pane, and the extracted date (2021-09-14) is in the right pane.

To try this out yourself, paste the following query, or "field" into the left pane of the Sensible app.


{
  /* SenseML support code comments using JSON5 */
  "preprocessors": [
    {
      /* correct oversplit lines
      see https://docs.sensible.so/docs/merge-lines */
      "type": "mergeLines",
      "directlyAdjacentThreshold": 0.16,
      "adjacentThreshold": 0.8,
      "yOverlapThreshold": 0.7
    }
  ],
  "fields": [
    {
      /* ID for target data */
      "id": "closing_information.date_issued",
      /* target data is a date, else return null */
      "type": "date",
      /* search for target data 
      near text "date issued" in doc*/
      "anchor": {
        "match": {
          "text": "date issued",
          "type": "startsWith"
        }
      },
      "method": {
        /* target to extract is in a row
           see https://docs.sensible.so/docs/row */
        "id": "row",
        /* target is to right of anchor 
        ("date issued") in row */
        "position": "right",
        /* grab 1st row cell (right of anchor) */
        "tiebreaker": "first"
      }
    }
  ]
}

You'll get this output:


{
  "closing_information.date_issued": {
    "source": "9/14/2021",
    "value": "2021-09-14T00:00:00.000Z",
    "type": "date"
  }
}

Extract loan type

See the following screenshot for an overview of how to extract the loan type:

Extract loan type

The query in the left pane in the preceding image looks for a checkbox near the text "loan type", starting 0.2 inches from the left side of its bounding box. It returns its selection status as true or false.

What if you want to examine all the loan type checkboxes and return only the selected choice? To try it out, paste the following query, or "field" into the left pane of the Sensible app:


{
  "fields": [
    {
      "id": "transaction_information.loan_type.conventional",
      "method": {
        /* target data is true/false checkbox.
           look for nearest checkbox starting 0.2"
           left of the anchor's right boundary (orange-outlined box) */
        "id": "nearestCheckbox",
        "position": "left",
        "offsetX": 0.2
      },
      "anchor": {
        /* target data is near text "Loan Type" */
        "match": {
          "text": "Loan Type",
          "type": "startsWith"
        }
      }
    },
    /* field extracts true/false checkbox for VA loans */
    {
      "id": "transaction_information.loan_type.va",
      "method": {
        "id": "nearestCheckbox",
        "position": "left"
      },
      "anchor": {
        "start": {
          "text": "Loan Type",
          "type": "startsWith"
        },
        "match": {
          "text": "VA",
          "type": "includes",
          "isCaseSensitive": true
        }
      }
    },
    /* field extracts true/false checkbox for FHA loans */
    {
      "id": "transaction_information.loan_type.fha",
      "method": {
        "id": "nearestCheckbox",
        "position": "right"
      },
      "anchor": {
        "match": {
          "text": "FHA",
          "type": "includes"
        }
      }
    },
  ],
  "computed_fields": [
    {
      /* to clean up output, return the single
         "true" checkbox value among 3
         checkboxes  */
      "id": "selected_loan_type",
      "method": {
        "id": "pickValues",
        "match": "one",
        "source_ids": [
          "transaction_information.loan_type.conventional",
          "transaction_information.loan_type.fha",
          "transaction_information.loan_type.va"
        ]
      }
    },
    {
      "id": "hide_fields",
      "method": {
          /* to clean up output, suppress the 
       source selection statuses */
        "id": "suppressOutput",
        "source_ids": [
          "transaction_information.loan_type.conventional",
          "transaction_information.loan_type.fha",
          "transaction_information.loan_type.va"
        ]
      }
    }
  ]
}

The query outputs:


{
  "selected_loan_type": {
    "value": "transaction_information.loan_type.conventional",
    "type": "string"
  }
}

Extract estimated escrow

See the following screenshot for an overview of how to extract the estimated escrow:

Extract estimated escrow

The query in the left pane in the preceding image looks for an intersection point between the horizontal and vertical lines bisecting two text phrases. This method, as well as the Region method, are ways to extract text in a coordinate-defined area when the document has too much variability to rely on methods such as Row.

To try this out yourself, paste the following query, or "field", into the left pane of the Sensible app:


{
  "fields": [
    {
      "id": "projected_payments.estimated_escrow",
      "type": "currency",
      "method": {
        /* intersection is an alternative to the 
           Row method when table cells
           are unpredictably populated
           target data is at intersection
           of vertical and horizontal lines
           defined by 2 anchors 
            */
        "id": "intersection",
        /* target data is on vertical line
           bisecting "Years 1-30" 
        */
        "verticalAnchor": {
          "match": {
            // match "Years 1-##" or "Years 1 - ##"
            "pattern": "Years 1-\\d{1,2}|Years 1 - \\d{1,2}",
            "type": "regex"
          }
        },
        // offsets the horizontal line downward
        "offsetX": 0.1,
        // offsets the vertical line to right
        "offsetY": 0.05
      },
      "anchor": {
        /* start looking for anchor match
           after "projected payments" */
        "start": {
          "text": "projected payments",
          "type": "startsWith"
        },
        "match": {
          /* target is on horizontal line
           bisecting "estimated escrow"  */
          "text": "estimated escrow",
          "type": "startsWith"
        }
      }
    }
  ]
}

You'll get this output:


{
  "projected_payments.estimated_escrow": {
    "source": "352.83",
    "value": 352.83,
    "unit": "$",
    "type": "currency"
  }
}

Extract transactions table

See the following screenshot for an overview of how to extract a table summarizing the borrower's transactions:

To try this out yourself, paste the following query, or "field" into the left pane of the Sensible app.


{
  "preprocessors": [
    {
      "type": "mergeLines",
      "directlyAdjacentThreshold": 0.16,
      "adjacentThreshold": 0.8,
      "yOverlapThreshold": 0.7
    }
  ],
  "fields": [
    {
      "id": "_summaries_of_transactions_tables.due_from_borrower_at_closing",
      /* target data is a table */
      "type": "table",
      "method": {
        /* of several table methods,
           textTable is fastest */
        "id": "textTable",
        "columns": [
          {
            /* first table column starts 0.5" from left page edge
               and ends 3" from left edge */
            "id": "due_from_borrower_at_closing",
            "minX": 0.5,
            "maxX": 3,
            "type": {
              "id": "custom",
              /* each cell starts w/ 2 numbers 
              followed by text or #s  */
              "pattern": "^\\d{2} [A-Za-z ()0-9]+$"
            },
            /* if cell doesn't start with 2 numbers,
               omit its row from output */
            "isRequired": true
          },
          /* 2nd column is 3.28-4.15" from left edge of page,
             cell contents are currency */
          {
            "id": "amount",
            "minX": 3.28,
            "maxX": 4.15,
            "type": "currency"
          }
        ],
        /* (recommended for performance)
        table ends at "adjustments" */
        "stop": {
          "text": "adjustments",
          "type": "equals"
        }
      },
      "anchor": {
        /* table starts after "due from borrower" line
           preceded by "summaries" line */
        "start": [
          {
            "text": "summaries of transactions",
            "type": "equals"
          }
        ],
        "match": {
          "text": "due from borrower at closing",
          "type": "includes"
        }
      }
    },
  ],
  "computed_fields": [
    {
      /* by default, table methods return
      column objects. transform these to
      row objects using the Zip method */
      "id": "summaries_of_transactions_tables.due_from_borrower_at_closing",
      "method": {
        "id": "zip",
        "source_ids": [
          "_summaries_of_transactions_tables.due_from_borrower_at_closing"
        ]
      }
    },
    /* to avoid redundant output, return
       the zipped table row objects and suppress the 
       original column objects */
    {
      "id": "clean_output",
      "method": {
        "id": "suppressOutput",
        "source_ids": [
          "_summaries_of_transactions_tables.due_from_borrower_at_closing"
        ]
      }
    }
  ]
}

You'll get this output:


{
  "summaries_of_transactions_tables.due_from_borrower_at_closing": [
    {
      "due_from_borrower_at_closing": {
        "source": "01 Sale Price of Property",
        "value": "01 Sale Price of Property",
        "type": "custom",
        "customType": "string"
      },
      "amount": {
        "source": "$400,491.00",
        "value": 400491,
        "unit": "$",
        "type": "currency"
      }
    },
    {
      "due_from_borrower_at_closing": {
        "source": "02 Sale Price of Any Personal Property Included in Sale",
        "value": "02 Sale Price of Any Personal Property Included in Sale",
        "type": "custom",
        "customType": "string"
      },
      "amount": null
    },
    {
      "due_from_borrower_at_closing": {
        "source": "03 Closing Costs Paid at Closing (J)",
        "value": "03 Closing Costs Paid at Closing (J)",
        "type": "custom",
        "customType": "string"
      },
      "amount": {
        "source": "$9,039.47",
        "value": 9039.47,
        "unit": "$",
        "type": "currency"
      }
    }
  ]
}

Extract more data

We've covered how to extract a few pieces of data from a closing disclosure. Our prebuilt config extracts much more information. Check it out! In the following screenshot, every blue-outlined line is a piece of extracted data:

Start extracting

Congratulations, you've learned some key methods for extracting structured data from closing disclosure documents. There's more extraction power for you to uncover. Sign up for a free account (150 docs a month, no credit card required), check out our prebuilt closing disclosure config in our open-source library, and peruse our docs to start extracting data from your own documents.