how-to

How to extract data from Explanation of Benefits documents

Frances Elliott
Wednesday, November 9, 2022

Use SenseML to extract structured data from a Medicaid explanation of benefits PDF.

Related Documents

If you’re building software for healthcare providers, chances are that you’ve come across the explanation of benefits (EOB) document. 

EOBs are generated by insurance companies to describe what costs the patient’s insurance policy will cover for the medical care that the patient received. This document is created whenever a healthcare provider submits a claim to the patient’s insurance company.

If a healthcare provider isn’t set up to receive an electronic remittance advice (ERA) with an insurance company, that insurer will instead send paper EOBs. Manually recording and reconciling an EOB is a slow, time-consuming process which then delays downstream workflows – like how quickly the healthcare provider gets paid. 

By automating pulling data out of EOBs, you can upload information into billing systems sooner and identify exceptions faster. Digitizing the information on EOBs also means that you can query and search records in the future.

With Sensible you can easily extract key information out of EOB PDFs using SenseML, Sensible’s query language for extracting data from documents. We’ve written a library of open-source SenseML configurations, so you don’t need to write queries from scratch for common documents. From there, your EOB data is accessible via API, Sensible’s UI, or 5,000 other software integrations thanks to Zapier.

What we'll cover

This blog post briefly walks you through configuring extractions for Medicaid EOBs. By the end, you’ll know a couple of SenseML methods and you’ll be on your way to extracting any data you choose using our documentation or our prebuilt open-source EOB configurations, which currently support Medicaid, Tricare, and Cigna EOBs.

Write document extraction queries with SenseML

Let's walk through extracting specific pieces of data from an explanation of benefits. Here's an example of an EOB PDF with redacted or dummy data:

Medicaid explanation of benefits

To follow along, you can sign up for a Sensible account, then download an example PDF and upload it to the Sensible app, or import the PDF and prebuilt open-source closing disclosure configurations directly to the Sensible app.

Our configurations for explanation of benefit extractions are comprehensive. To keep the example in this post simple, let's extract just the:

  • Remittance date
  • Total allowed charges for each claim
  • Service details such as dates and codes

Extract remittance date

See the following screenshot for an overview of how to extract the remittance date.

Extract date (left pane: query. middle pane: document. right pane: output)

The query in the left pane in the preceding image treats the string "DATE" as the first cell in a row, and searches to the right of it for the date. The PDF is displayed in the middle pane, and the extracted date (2022-01-13) is in the right pane.

To try this out yourself, paste the following query, or "field" into the left pane of the Sensible app.


{
  /* SenseML uses JSON5 to support code comments */
  "preprocessors": [
    {
      /* correct oversplit lines
      see https://docs.sensible.so/docs/merge-lines */
      "type": "splitLines",
      "minSpaces": 2
    }
  ],
  "fields": [
    {
      /* ID for target data */
      "id": "date_of_remittance",
      /* target data is a date, else return null */
      "type": "date",
      "anchor": {
        "match": {
          /* search for target data 
           near text "DATE" in doc*/
          "text": "DATE",
          "type": "includes",
          "isCaseSensitive": true
        }
      },
      "method": {
        /* target to extract is in a row
           see https://docs.sensible.so/docs/row */
        "id": "row",
        /* target is to right of anchor 
        ("date issued") in row */
        "position": "right",
        /* grab 1st row cell (right of anchor) */
        "tiebreaker": "first"
      }
    },
  ]
}

You'll get this output:


{
  "date_of_remittance": {
    "source": "01/13/2022",
    "value": "2022-01-13T00:00:00.000Z",
    "type": "date"
  }
}

Extract an allowed covered charge

See the following screenshot for an overview of how to extract the first total allowed covered charge that occurs in the document:

Extract first total

The query in the left pane in the preceding image extracts all the text in a rectangular region with coordinates defined relative to the anchoring text. This method, as well as the Intersection method, are ways to extract text in a coordinate-defined area when the document has too much variability to rely on methods such as Row.

To try this out yourself, paste the following query into the left pane of the Sensible app:


{
  /* SenseML uses JSON5 to support code comments */
  "preprocessors": [
    {
      /* correct oversplit lines
      see https://docs.sensible.so/docs/merge-lines */
      "type": "splitLines",
      "minSpaces": 2
    }
  ],
  "fields": [
    {
      "id": "totals.allowed_covered_charges",
      /* target data is currency */
      "type": "currency",
      /* use first anchor match in doc */
      "match": "first",
      "anchor": {
        "match": {
          /* target is near **CLAIM TOTAL** 
             anchor text */
          "type": "includes",
          "text": "**CLAIM TOTAL**",
          "isCaseSensitive": true
        }
      },
      "method": {
        /* target data is in rectangular region
           with coordinates relative to anchor.
           To display coordinates, 
           drag mouse in Sensible app */
        "id": "region",
        /* use Region as alternative to Row 
           to handle unpredictably empty cells */
        "start": "left",
        "width": 0.49,
        "height": 0.17,
        "offsetX": 3.2,
        "offsetY": -0.07
      }
    }
  ]
}

You'll get this output:


{
  "totals.allowed_covered_charges": {
    "source": "125.19",
    "value": 125.19,
    "unit": "$",
    "type": "currency"
  }
}

Extract each claim’s details

You just saw how to extract a single total charge. In real life, you’ll want to extract each total charge for each claim. See the following screenshot for an overview of how to extract an array of claims with charges, instead of a single charge:

The query in the left pane in the preceding image slices the document into claims sections, each of which starts with a 7-digit claim number and ends with “**CLAIM TOTAL**”.  Each claim section is its own mini document with its own set of fields.  For example, you just learned about using the Region method to get the first total allowed covered charge. When you use the Region method in a section, you get back the charge for each claim.

To try it out, paste the following query into the left pane of the Sensible app:


{
  /* SenseML uses JSON5 to support code comments */
  "preprocessors": [
    {
      /* correct oversplit lines
      see https://docs.sensible.so/docs/merge-lines */
      "type": "splitLines",
      "minSpaces": 2
    }
  ],
  "fields": [],
  "sections": [
    {
      /* Get array of claims, each containing info
         like total charges, dates, and services.
         Each claim is a 'section', or mini-document,
         isolated from the rest of the doc.
         For more info see https://docs.sensible.so/docs/sections */
      "id": "claims_details_sections",
      /* Display section boundaries in the Sensible app 
         as brackets */
      "display": true,
      "range": {
        "anchor": {
          "start": {
            /* start looking for claims after this text
            ("PAID CLAIMS") (near start of doc)" */
            "text": "PAID CLAIMS",
            "type": "startsWith",
            "isCaseSensitive": true
          },
          /* each claim starts with 7-digit claim number */
          "match": {
            "type": "regex",
            "pattern": "^\\d{7}$"
          },
          "end": {
            /* stop looking for claims
               above this text (near end of doc) */
            "text": "paid claim totals",
            "type": "includes"
          }
        },
        /* each claim ends with this text (***CLAIM TOTAL**) */
        "stop": {
          "text": "**CLAIM TOTAL**",
          "type": "includes",
          "isCaseSensitive": true
        },
        "requireStop": true,
      },
      "fields": [
        {
          /* extract patient name for each claim */
          "id": "patient_name",
          "anchor": {
            "match": {
              /* patient name is near a 7-digit number (claim #) */
              "type": "regex",
              "pattern": "^\\d{7}$"
            }
          },
          "method": {
            /* name is in a row, 1st cell to right of claim # */
            "id": "row",
            "position": "right",
            "tiebreaker": "first"
          }
        },
        /* extract total allowed covered charges for each claim */
        {
          "id": "totals.allowed_covered_charges",
          "type": "accountingCurrency",
          "anchor": {
            "match": {
              "type": "includes",
              "text": "claim total"
            }
          },
          "method": {
            "id": "region",
            "start": "left",
            "width": 0.49,
            "height": 0.13,
            "offsetX": 3.2,
            "offsetY": -0.06
          }
        }
      ]
    }
  ]
}

You'll get this output:


{
  "claims_details_sections": [
    {
      "patient_name": {
        "value": "SMITH, JOHN",
        "type": "string"
      },
      "totals.allowed_covered_charges": {
        "source": "295.65",
        "value": 295.65,
        "unit": "$",
        "type": "accountingCurrency"
      }
    },
    {
      "patient_name": {
        "value": "SMITH, JANE",
        "type": "string"
      },
      "totals.allowed_covered_charges": {
        "source": "29.00",
        "value": 29,
        "unit": "$",
        "type": "accountingCurrency"
      }
    },
    {
      "patient_name": {
        "value": "SMITH, JOE",
        "type": "string"
      },
      "totals.allowed_covered_charges": {
        "source": "98.55",
        "value": 98.55,
        "unit": "$",
        "type": "accountingCurrency"
      }
    }
  ]
}

Restructure extracted data 

Each claim can contain several columns containing services with dates and number of procedures. In this configuration, Sensible extracts these columns as parallel arrays. See the following screenshot for an overview: 

Configuration extracts columns by default

Since each row represents a service, rows are more convenient than columns in this case. For an overview of how to zip the “column” arrays into “rows” of corresponding data for each service, see the following screenshot:

Zipped service details

To try this out yourself, paste the following query into the left pane of the Sensible app:


{
  "preprocessors": [
    {
      /* correct oversplit lines
      see https://docs.sensible.so/docs/merge-lines */
      "type": "splitLines",
      "minSpaces": 2
    }
  ],
  "fields": [],
  "sections": [
    {
      "id": "claims_details",
      "range": {
        "anchor": {
          "start": {
            "text": "paid claims",
            "type": "startsWith"
          },
          "match": {
            "type": "regex",
            "pattern": "^\\d{7}$"
          },
          "end": {
            "text": "paid claim totals",
            "type": "includes"
          }
        },
        "stop": {
          "text": "claim total",
          "type": "includes"
        },
        "requireStop": true
      },
      "fields": [
        {
          "id": "patient_name",
          "anchor": {
            "match": {
              "type": "regex",
              "pattern": "^\\d{7}$"
            }
          },
          "method": {
            "id": "row",
            "position": "right",
            "tiebreaker": "first"
          }
        },
        /* extract start date of service */
        {
          "id": "dates_of_service.begin",
          /* match all to get an array of 
          begin dates in this claim */
          "match": "all",
          "type": {
            "id": "date",
            /* target is a custom 8-digit 
               date format (05252022) */
            "format": "%M%D%Y"
          },
          "anchor": {
            "match": {
              "type": "regex",
              /* anchor on 2 consecutive 
              8-digit numbers (dates) */
              "pattern": "\\d{8} \\d{8}"
            }
          },
          "method": {
            "id": "regex",
            /* extract the 1st of 2 consecutive dates */
            "pattern": "(\\d{8}) \\d{8}"
          }
        },
        {
          /* extract array of end dates of service */
          "id": "dates_of_service.end",
          "match": "all",
          "type": {
            "id": "date",
            "format": "%M%D%Y"
          },
          "anchor": {
            "match": {
              "type": "regex",
              "pattern": "\\d{8} \\d{8}"
            }
          },
          "method": {
            "id": "regex",
            /* extract the 2nd of 2 consecutive dates */
            "pattern": "\\d{8} (\\d{8})"
          }
        },
        /* extract array of procedure codes */
        {
          "id": "procedure",
          "match": "all",
          "anchor": {
            "match": {
              /* procedure is near 2 8-digit numbers (dates) */
              "type": "regex",
              "pattern": "\\d{8} \\d{8}"
            }
          },
          "method": {
            /* procedure is in a row, 2nd cell to 
               right of dates */
            "id": "row",
            "position": "right",
            "tiebreaker": "second"
          }
        },
        {
          /* extract array of number of services */
          "id": "number_of_services",
          "match": "all",
          "anchor": {
            "match": {
              "type": "regex",
              /* services # is near 2 
                 8-digit numbers (dates) */
              "pattern": "\\d{8} \\d{8}"
            }
          },
          "method": {
            /* services # is in a row, 1st cell 
               to right of dates */
            "id": "row",
            "position": "right",
            "tiebreaker": "first"
          }
        }
      ],
      "computed_fields": [
        {
          /* zip following fields from parallel 
             "column" arrays
            into "row" objects for each service */
          "id": "services_zipped",
          "method": {
            "id": "zip",
            "source_ids": [
              "dates_of_service.begin",
              "dates_of_service.end",
              "procedure",
              "number_of_services"
            ]
          }
        },
        {
          "id": "clean_output",
          "method": {
            /* for cleaner output, 
               don't output the 
              source fields used in zip */
            "id": "suppressOutput",
            "source_ids": [
              "dates_of_service.begin",
              "dates_of_service.end",
              "procedure",
              "number_of_services"
            ]
          }
        }
      ]
    }
  ]
}

You'll get this output:


{
  "claims_details": [
    {
      "patient_name": {
        "value": "SMITH, JOHN",
        "type": "string"
      },
      "services_zipped": [
        {
          "dates_of_service.begin": {
            "source": "05232022",
            "value": "2022-05-23T00:00:00.000Z",
            "type": "date"
          },
          "dates_of_service.end": {
            "source": "05232022",
            "value": "2022-05-23T00:00:00.000Z",
            "type": "date"
          },
          "procedure": {
            "value": "S9480",
            "type": "string"
          },
          "number_of_services": {
            "value": "1.000",
            "type": "string"
          }
        },
        {
          "dates_of_service.begin": {
            "source": "05252022",
            "value": "2022-05-25T00:00:00.000Z",
            "type": "date"
          },
          "dates_of_service.end": {
            "source": "05252022",
            "value": "2022-05-25T00:00:00.000Z",
            "type": "date"
          },
          "procedure": {
            "value": "S9480",
            "type": "string"
          },
          "number_of_services": {
            "value": "1.000",
            "type": "string"
          }
        },
        {
          "dates_of_service.begin": {
            "source": "05262022",
            "value": "2022-05-26T00:00:00.000Z",
            "type": "date"
          },
          "dates_of_service.end": {
            "source": "05262022",
            "value": "2022-05-26T00:00:00.000Z",
            "type": "date"
          },
          "procedure": {
            "value": "S9480",
            "type": "string"
          },
          "number_of_services": {
            "value": "1.000",
            "type": "string"
          }
        }
      ]
    },
    {
      "patient_name": {
        "value": "SMITH, JANE",
        "type": "string"
      },
      "services_zipped": [
        {
          "dates_of_service.begin": {
            "source": "05252022",
            "value": "2022-05-25T00:00:00.000Z",
            "type": "date"
          },
          "dates_of_service.end": {
            "source": "05252022",
            "value": "2022-05-25T00:00:00.000Z",
            "type": "date"
          },
          "procedure": {
            "value": "H2019 HA",
            "type": "string"
          },
          "number_of_services": {
            "value": "4.000",
            "type": "string"
          }
        }
      ]
    },
    {
      "patient_name": {
        "value": "SMITH, JOE",
        "type": "string"
      },
      "services_zipped": [
        {
          "dates_of_service.begin": {
            "source": "05282022",
            "value": "2022-05-28T00:00:00.000Z",
            "type": "date"
          },
          "dates_of_service.end": {
            "source": "05282022",
            "value": "2022-05-28T00:00:00.000Z",
            "type": "date"
          },
          "procedure": {
            "value": "S9480",
            "type": "string"
          },
          "number_of_services": {
            "value": "1.000",
            "type": "string"
          }
        }
      ]
    }
  ]
}

Extract more data

We've covered how to extract a few pieces of data from an explanation of benefits (EOB). Our prebuilt config extracts much more information. Check it out! In the following screenshot, every blue-outlined line is a piece of extracted data:

Full EOB extraction

Start extracting

Congratulations, you've learned some key methods for extracting structured data from EOB documents. There's more extraction power for you to uncover. Sign up for a free account (150 docs a month, no credit card required), check out our prebuilt EOB config in our open-source library, and peruse our docs to start extracting data from your own documents.