how-to

Extracting data from your personal PDFs: analyzing utility bills

Frances Elliott
Tuesday, January 25, 2022

Extract useful data from personal PDFs, like old bank statements or gas bills, with Sensible

Related Documents

Many of us have useful personal finance data hidden away in PDFs, like old bank statements from closed accounts or utility bills. Getting the information into usable form can be painfully manual. In this post, let’s explore how you can extract nicely structured data from your personal PDFs with a free Sensible account.

For this example, let’s imagine you want to reduce your energy use by analyzing the data in the PDF statements your gas utility company emails to you. With a bit of technical savvy, you can:

  • Spend a few minutes in the Sensible app writing SenseML queries to extract key data from the statements, like the price of gas per therm and number of therms used per month.
  • Trigger the Sensible API to extract the data specified by your queries, when the gas company emails you the PDF statement. 
  • Send the Sensible API responses to CSV, Google Sheets, Airtable, or another destination. For example, each time you receive a statement, you use Sensible to automatically add a row to a custom table, like this:


statement_date total_therms cost_per_therm total_bill
27-Jan-2020 273.928 0.2124 160.56
24-Feb-2020 237.365 0.1446 134.42
25-Mar-2020 151.928 0.08 73.8
26-Jan-2021 250.264 0.3504 172.96
24-Feb-2021 235.341 0.3643 160.31
26-Mar-2021 170.09 0.481 137.93

This post assumes you’re comfortable with scripting or with automation tools like Zapier for tasks like setting up email triggers and sending API results to your preferred destination. So, we’ll focus on the Sensible part of this dataflow:

  • Write SenseML document extraction queries
  • Validate the extraction
  • Call the Sensible API

Write document extraction queries

Let’s say we want to extract data from monthly gas statements that look like this:

gas bill
Gas bill

To extract the data, you can use Sensible’s query language, SenseML. First you first need to decide on the data to extract from the PDFs. To keep it simple, let’s choose:

  • statement date
  • total therms used
  • cost per therm
  • total bill

Use the Sensible app to author and check the output of your extraction queries (“configs”). For example, your finished queries might look like this:

SenseML queries in Sensible app
SenseML queries in Sensible app

If you want to follow along, you can use an example PDF and the following example config to try it out yourself in the Sensible app:

{
  "fields": [
    {
      "id": "statement_date",
      "anchor": "billing date",
      "method": {
        "id": "row"
      }
    },
    {
      "id": "cost_per_therm",
      "type": "number",
      "anchor": {
        "match": [
          {
            "text": "cost of gas is estimated to be",
            "type": "includes"
          }
        ]
      },
      "method": {
        "id": "passthrough"
      }
    },
    {
      "id": "total_bill",
      "type": "currency",
      "anchor": {
        "match": {
          "text": "total current gas charges",
          "type": "startsWith"
        }
      },
      "method": {
        "id": "row"
      }
    },
    {
      "id": "total_therms",
      "type": "number",
      "anchor": {
        "match": [
          {
            "type": "startsWith",
            "text": "conversion"
          },
          {
            "type": "startsWith",
            "text": "therms"
          }
        ]
      },
      "method": {
        "id": "label",
        "position": "below"
      }
    }
  ]
}

Validate the data

To make sure you extract reasonable values, let’s write some validations. For example, let’s say the cost of gas per therms has historically been under $1. We can trigger a warning if that extracted value is over $1 using JsonLogic


{"<":[{"var":"cost_per_therm.value"},1]}


Let’s specify the warning in the Sensible app:

Create a validation

Call the API

Now you’ve got your queries and validations set up, you can start extracting from your PDF statements using the Sensible API. For example, if you assume:

  • the config example from the previous section is in a document type called "utility_bills"
  • you have a PDF named "utility_statement_gas_dec_2019"

Then you could use curl to call the API:


curl --request POST \
  --url "https://api.sensible.so/v0/extract/utility_bills" \
  --header "Authorization: Bearer YOUR_API_TOKEN" \
  --header "Content-Type: application/pdf" \
  --data-binary "@utility_statement_gas_dec_2019.pdf" \

And get a response like:


{
  "configuration": "gas_bill",
  "created": "2022-01-19T19:07:27.729Z",
  "errors": [],
  "id": "cddb1eb1-c743-45c9-9d0c-b2222cfe1d42",
  "parsed_document": {
    "cost_per_therm": {
      "source": ".2436",
      "type": "number",
      "value": 0.2436
    },
    "statement_date": {
      "type": "string",
      "value": "26-DEC-2019"
    },
    "total_bill": {
      "source": "$134.79",
      "type": "currency",
      "unit": "$",
      "value": 134.79
    },
    "total_therms": {
      "source": "230.854",
      "type": "number",
      "value": 230.854
    }
  },
  "status": "COMPLETE",
  "type": "utility_bills",
  "validation_summary": {
    "errors": 0,
    "fields": 4,
    "fields_present": 4,
    "skipped": 0,
    "warnings": 0
  },
  "validations": []
}

Try it for free

To get data out of your personal PDFs, sign up for Sensible (free extractions for up to 150 documents a month).