How to extract data from CMS-1500 forms with Sensible

Updated on

May 27, 2025

min read

Contributors

No items found.

Author

Frances Elliott

Table of contents

In the healthcare industry, the CMS-1500 form (formerly known as the HCFA-1500) is a standardized paper claim form used by healthcare providers to bill Medicare, Medicaid, and most other insurance carriers. For companies in healthcare tech, automatically extracting data from these forms is critical for streamlining claims processing, reducing manual entry errors, and accelerating reimbursement timelines.
‍

Enter Sensible. With Sensible, you can easily parse key information from CMS-1500 forms using SenseML, Sensible's query language for extracting data from documents. We've written a library of open-source SenseML configurations, so you don't need to write queries from scratch for common documents. From there, your extracted healthcare data is accessible via API, Sensible's UI, or thousands of other software integrations through Zapier.
‍

Note that while Sensible offers powerful LLM-based SenseML methods to parse these documents, CMS-1500 forms have a standardized layout that makes them excellent candidates for our layout-based methods. These methods are not only fast but also extremely accurate for forms with consistent structures. So, this tutorial will focus on layout-based methods to extract from CMS-1500 forms.

‍

What we'll cover

‍

This blog post will walk you through extracting specific pieces of information from an example CMS-1500 form:

‍

‍

By the end, you'll know several SenseML methods and you'll be on your way to extracting any data you choose using our documentation or our prebuilt open-source configurations.

‍

Here’s the example document we’ll use with dummy patient data:

‍

‍

To follow along, you can sign up for a Sensible account, then import an example CMS-1500 PDF and prebuilt open-source configurations directly to the Sensible app using the docs for Out-of-the-box extractions.

Our configurations for bank statement extractions are comprehensive. To keep the example in this post simple, let's extract just the:
‍

carrier name
patient name
patient’s marital status
Lines of service
‍

Identify form revision with fingerprints

‍

First, let's identify the revision number (08-05) for the CMS-1500 form in order to optimize the extraction process. We’ll use fingerprints to do so. (Note that classifying the form generally as a CMS-1500 happens upstream and isn’t covered in this tutorial.)
‍


{
  "fingerprint": {
    "tests": [
      {
        "page": "every",
        "match": [
          {
            "text": "FORM CMS-1500 (08-05)",
            "type": "endsWith",
            "isCaseSensitive": true
          }
        ]
      }
    ]
  }
}

‍

This fingerprint tests the CMS-1500 revision by checking that every page contains the text "FORM CMS-1500 (08-05)" at the end of a line. If this test passes, Sensible will use a specific set of queries to extract data from the document. This approach helps Sensible quickly determine the appropriate queries before attempting to extract data from it.

‍

Extract the carrier information

‍

Let's extract the carrier information in the top left of the form:

‍

‍

Here are the queries we’ll use:
‍


{
  /* Sensible supports inline comments using JSON5 */
  
  "id": "carrier", // user-friendly ID for the extracted data
  "anchor": {
    "match": {
      /* the target text is near the word "carrier"
      (in this case, vertical text on right page margin) */
      "text": "CARRIER",
      "type": "includes",
      "isCaseSensitive": true
    }
  },
  "method": {
    /* define a rectangular region relative to the anchor
       and extract the text in it */
    "id": "region",
    /* start at the left side of the anchor line */
    "start": "left",
    /* region is 2.5 inches wide */
    "width": 2.5,
    /* region is 1 inch tall */
    "height": 1,
    /* offset 3 inches to the left of the anchor */
    "offsetX": -3,
    /* offset 0.6 inches up from the anchor */
    "offsetY": -0.6,
    /* ensure lines in the region are sorted left-to-right */
    "sortLines": "readingOrderLeftToRight"
  }
}

‍

This field uses the Region method to extract carrier information. We anchor on the word "CARRIER" and define a rectangular region (2.5 × 1 inches) that's positioned 3 inches to the left and 0.6 inches up from the anchor (displayed as a green rectangular overlay in the preceding screenshot). The Region method extracts all text within this defined area. The sortLines parameter ensures that if there are multiple lines of text in this region, they're read in the correct order.
‍

Extracted value:
‍


"carrier": {
  "type": "string",
  "value": "Test Company"
}

‍

Extract the patient's name

‍

Now let's extract the patient's name from the form:

‍

‍


{
  "id": "item2.patients_name",
  "anchor": {
    "match": {
      /* target data is near text 'patient's name' */
      "text": "patient's name",
      "type": "includes",
      /* allow for OCR errors with editDistance */
      "editDistance": 1
    }
  },
  "method": {
    /* extract the line below the anchoring text */
    "id": "label",
    "position": "below"
  }
}

‍

This field uses the Label method to extract the patient's name. We anchor on the text "patient's name" and specify that we want to extract the text directly below this anchor. The editDistance parameter allows for minor OCR errors in the anchor text, making the extraction more robust with scanned documents.

The Label method is well suited to the CMS-1500 form, because much of the data is structured in a label-value format, where a label (like "patient's name") appears near the actual data we want to extract.

Extracted value:
‍


"item2.patients_name": {
  "type": "string",
  "value": "Smith, John"
}

‍

Extract patient's marital status

‍

To determine the patient's marital status, we’ll check if the "Single" checkbox is selected:
‍

‍



{
  "id": "item8.patient_status.single",
  "anchor": {
    "match": {
      "text": "single",
      "type": "startsWith"
    }
  },
  "method": {
    /* find the checkbox closest to the text "single" 
       and return its selection status as a boolean */
    "id": "nearestCheckbox",
    /* search to the right of the anchor text */
    "position": "right"
  }
}

‍

This field uses the Nearest Checkbox method to determine if the "Single" checkbox is selected. We anchor on the text "single" and search for the nearest checkbox to the right of this text.

The Nearest Checkbox method can handle a wide variety of checkbox formats. It uses either the document's own metadata about form fields (if available) or falls back to advanced OCR to detect checkbox selections.

Extracted value:
‍



"item8.patient_status.single": {
  "type": "boolean",
  "value": true
}

‍

Extract service line items

‍

CMS-1500 forms can have multiple service line items in field 24:

‍

‍

We can extract these using Sections:

‍



{
      "id": "item24",
      /* each row containing a service line item is a section */
      "type": "sections",
      "range": {
        "anchor": {
          "start": [
            {
              /* to find the sections group start, ignore all the text in the document that vertically precedes the text "MODIFIER" preceded by the text "24" */
              "type": "startsWith",
              "text": "24."
            },
            {
              "type": "includes",
              "text": "MODIFIER",
              "isCaseSensitive": true
            }
          ],
          /* each section starts with a date (two digits, space, two digits)  */
          "match": {
            "type": "regex",
            "pattern": "^\\d{2} \\d{2}"
          },
          /* end looking for sections before form's item 25 */
          "end": {
            "text": "25.",
            "type": "startsWith"
          }
        },
        /* each section stops before "25".
           Optional param; prevents last section in group
           from extending to end of document */
        "stop": {
          "text": "25.",
          "type": "startsWith"
        }
      },
      "fields": [
        {
          /* extract each section's 'from' service date
             using the Region method */
          "id": "a.dates_of_service.from",
          "type": {
            "id": "date",
            "format": [
              "%M %D %y",
              "%M1 %D %y",
              "%M 1 %D %y"
            ]
          },
          "anchor": {
            "match": {
              "type": "regex",
              "pattern": "^\\d{2} \\d{2}"
            }
          },
          "method": {
            "id": "region",
            "start": "left",
            "width": 0.8,
            "height": 0.25,
            "offsetX": -0.01,
            "offsetY": -0.15,
            "sortLines": "readingOrderLeftToRight"
          }
        },
        /* extract each section's place of service
           using the Region method */
        {
          "id": "b.place_of_service",
          "anchor": {
            "match": {
              "type": "regex",
              "pattern": "^\\d{2} \\d{2}"
            },
            "end": [
              {
                "type": "regex",
                "pattern": "^\\d{2} \\d{2}"
              },
              {
                "type": "regex",
                "pattern": "^\\d{2} \\d{2}"
              }
            ]
          },
          "method": {
            "id": "region",
            "start": "left",
            "width": 0.41,
            "height": 0.25,
            "offsetX": 1.7,
            "offsetY": -0.15,
            "sortLines": "readingOrderLeftToRight"
          }
        },
        /* Additional fields for dates_of_service.to,
       procedures, charges, etc. */
      ]
    }

‍

This field uses the Sections method to extract multiple service line items from field 24. For each service line, you candefine multiple subfields that extract specific pieces of information like service dates, place of service, procedure codes, and charges.

The Sections method is powerful for handling repeating data structures. It allows us to define a range where these sections appear and then extract consistent data from each section.
‍

Extracted data:

‍



"item24": [
    {
      "a.dates_of_service.from": {
        "source": "06 10 22",
        "value": "2022-06-10T00:00:00.000Z",
        "type": "date"
      },
      "b.place_of_service": {
        "type": "string",
        "value": "2"
      },
      /* more fields extracted from section here */
    },
    {
      "a.dates_of_service.from": {
        "source": "06 10 22",
        "value": "2022-06-10T00:00:00.000Z",
        "type": "date"
      },
      "b.place_of_service": {
        "type": "string",
        "value": "11"
      },
      /* more fields extracted from section here */
  ],

‍

Putting it all together

When you run this configuration against a CMS-1500 form, Sensible extracts all the defined fields and returns them in a structured JSON format that's ready to be integrated with your systems.

Sample output for the extracted fields covered in this tutorial:

‍



{
  "carrier": {
    "type": "string",
    "value": "Test Company"
  },
  "item2.patients_name": {
    "type": "string",
    "value": "Smith, John"
  },
  "item8.patient_status.single": {
    "type": "boolean",
    "value": true
  },
  "item24": [
    {
      "a.dates_of_service.from": {
        "source": "06 10 22",
        "value": "2022-06-10T00:00:00.000Z",
        "type": "date"
      },
      "b.place_of_service": {
        "type": "string",
        "value": "2"
      },
      /* more fields extracted from section here */
    },
    {
      "a.dates_of_service.from": {
        "source": "06 10 22",
        "value": "2022-06-10T00:00:00.000Z",
        "type": "date"
      },
      "b.place_of_service": {
        "type": "string",
        "value": "11"
      },
      /* more fields extracted from section here */
  ]
}

‍
Extract more data

We've covered how to extract a few key pieces of data from CMS-1500 forms. Our prebuilt configuration extracts much more information, including insurance details, diagnosis codes, referring provider information, and billing provider details. That full extraction coverage enables use cases such as:
‍

Automated claims processing
Real-time eligibility verification
Integration with EHR systems
Compliance and audit preparation
‍

Start extracting

Congratulations, you've learned some key methods for extracting structured data from CMS-1500 forms! There's more extraction power for you to uncover. Book a demo or check out our managed services for customized implementation support. Or explore on your own: sign up for an account, check out our prebuilt healthcare templates in our open-source library, and peruse our docs to start extracting data from your own documents.

Frances Elliott

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.

Start Extracting Book a demo

How to extract data from CMS-1500 forms with Sensible

What we'll cover

Identify form revision with fingerprints

Extract the carrier information

Extract the patient's name

Extract patient's marital status

Extract service line items

Putting it all together

‍
Extract more data

Start extracting

Turn documents into structured data

Related posts

How to extract data from employment verification forms with Sensible

Splitting Multi-Document PDFs with LLMs

The opinionated guide to JsonLogic for transforming document data

How to automate human-in-the-loop review for document processing