How to extract data from employment verification forms with Sensible

Updated on

June 19, 2025

min read

Contributors

No items found.

Author

Frances Elliott

Table of contents

Employment verification forms (VOEs) serve as essential documents in financial services, providing lenders with crucial information about applicants' income history, employment status, and financial stability. Whether you're processing mortgage applications, personal loans, or credit checks, automating data extraction from these forms can significantly streamline your underwriting process and reduce manual data entry errors.

Each verification provider has its own document format, which presents an interesting challenge for document automation. Enter Sensible, which allows you to handle these variations using SenseML, Sensible's query language for extracting data from documents. We've written a library of open-source SenseML configurations, so you don't need to write queries from scratch for common documents. From there, your extracted employment verification data is accessible via API, Sensible's UI, or thousands of other software integrations through Zapier.

Note that Sensible offers powerful AI-based methods to parse these documents. For example, we offer tutorials on extracting from rent rolls and resumes using LLMs. In contrast to such free-form documents, employment verification forms have consistent layouts that make them excellent candidates for our layout-based methods. These methods are not only fast but also extremely accurate for forms with structured formats. So, this tutorial will focus on layout-based methods to extract from VOEs.

What we'll cover
‍

This blog post will walk you through extracting data from two different employment verification providers: Truework and Equifax:
‍

*Sensible app showing queries, sample document, and extracted document field*

‍

We'll examine how the same information requires different extraction approaches based on each provider's document layout. Here are the example documents we’ll use with dummy data:

*Truework employment verification document*

‍

*Equifax employment verification document*

‍

By the end, you'll understand several SenseML methods and you'll be on your way to extracting any data you choose using our documentation or our prebuilt open-source configurations.

‍

Prerequisites
‍

To follow along, you can sign up for a Sensible account, then import example employment verification PDFs and prebuilt open-source configurations directly to the Sensible app using the Out-of-the-box extractions tutorial.

Our configurations for employment verification extractions are comprehensive. To keep the example in this post simple, let's extract solely the following:
‍

employee name
employer address
second-year base pay
And show how fingerprints identify document subtypes

‍
Pre-extraction provider identification
‍

First, let's walk through identifying different VOE providers, so we use the appropriate queries for each format. We'll use “fingerprints” to do so. (Note that classifying the document generally as a VOE happens upstream and isn't covered in this tutorial.) Fingerprints help Sensible quickly determine the appropriate queries before attempting to extract data from a document.

Truework fingerprint


{
  /* Sensible uses JSON5 to support in-line comments */
  "fingerprint": {
    "tests": [
      {
        /* test every page for consistent Truework branding */
        "page": "every",
        "match": [
          {
            /* look for the standard report title */
            "text": "Verification of Income Report",
            "type": "endsWith",
            "isCaseSensitive": true
          },
          {
            /* verify Truework branding is present */
            "text": "truework",
            "type": "endsWith"
          }
        ]
      }
    ]
  }
}

‍

The Truework fingerprint tests by checking that every page contains the standard report title and Truework branding.

‍

Equifax fingerprint


{
  "fingerprint": {
    "tests": [
      {
        /* test the first page for unique Equifax elements */
        "page": "first",
        "match": [
          {
            /* look for the standard ORDER INFORMATION header */
            "text": "ORDER INFORMATION",
            "type": "equals",
            "isCaseSensitive": true
          },
          {
            /* verify this is a verification document */
            "text": "Verified On:",
            "type": "equals",
            "isCaseSensitive": true
          },
          {
            /* confirm it contains 'verification type' text*/
            "text": "Verification Type:",
            "type": "equals",
            "isCaseSensitive": true
          }
        ]
      },
      {
        /* test the last page for Equifax footer content */
        "page": "last",
        "match": [
          {
            /* match any 4-digit year (e.g., 2021, 2022) */
            "pattern": "^20\\d{2}$",
            "type": "regex"
          },
          {
            /* verify the standard Equifax verification statement */
            "text": "The statement above is an official verification generated",
            "type": "includes"
          }
        ]
      }
    ]
  }
}

‍

This fingerprint tests the Equifax format by checking that the document contains specific text patterns unique to Equifax reports. If these tests pass, Sensible will use the Equifax-specific extraction queries for this document. The key difference when writing fingerprints for these providers is that Truework maintains consistent branding throughout their shorter documents, while Equifax uses a more complex document structure requiring multi-page validation.

‍

Extract employee name
‍

Let’s compare and contrast different methods for extracting the employee name from different providers’ document layouts. We’ll also look at fallback strategies for handling document variations from a single provider.

‍

Truework employee name extraction

The Truework form clearly labels the employee name:

‍

To extract this data, let’s use the following SenseML query:
‍


{
  "fields": [
    {
      /* user-friendly ID for the extracted data */
      "id": "employee_name",
      "anchor": {
        "match": {
          /* search for target data
      near anchor text 'full name' in doc*/
          "text": "full name",
          "type": "startsWith"
        }
      },
      "method": {
        /*   target text is to the right of anchor in a row */
        "id": "row",
        "position": "right"
      }
    }
  ]
}

‍

Truework's clean layout allows for a simple approach:
‍

Employee name has a "Full Name" label, also called an “anchor”
We can use the Row method to extract the text that’s horizontally aligned with the anchor.
‍

A note on layout-based extraction

This first field example demonstrates some basic principles of SenseML layout-based methods:
‍

Each “field” is a basic query unit in Sensible. Each field outputs a piece of data from the document that you want to extract. Sensible uses the field id as the key in the key/value JSON output.
Sensible searches first for a text "anchor" because it's a computationally quick way to narrow down the location of the target data to extract.
Then, Sensible uses a "method" to expand its search out from the anchor and extract the data you want.
‍

Equifax employee name extraction (Primary method)

To extract the Equifax employee name, we’ll use a primary field and a fallback field. This accounts for layout variation where the social security number (SSN) can be present but redacted, unredacted, or missing completely. In our example document, the redacted SSN is present, so the primary method works.

‍

We’ll use the following SenseML queries:
‍


{
  "id": "employee_name",
  "anchor": {
    "match": {
      /* anchor can be any of the following matches */
      "type": "any",
      "matches": [
        {
          /* look for redacted SSN format as anchor */
          "text": "xxx-xx",
          "type": "includes"
        },
        {
          /* or look for 9-10 digit number without dashes (unredacted SSN ). Note we allow matching on erroneously formatted, 10-digit SSNs because we've encountered them in the wild with Equifax forms. */
          "pattern": "\\d{9,10}$",
          "type": "regex"
        }
      ]
    }
  },
  "method": {
    /* extract the name that appears to the left of the SSN */
    "id": "label",
    "position": "left"
  }
}

‍

The primary Equifax method uses the SSN as an anchor because:
‍

Employee names consistently appear to the left of SSN information
SSNs appear in a predictable format (either redacted as "xxx-xx-####" or as digits)
The Label method can extract text positioned relatively closely to the anchor
‍

Equifax employee name extraction (Fallback method)

When the social security number is missing, we’ll fall back to the following query:
‍


{
  "id": "employee_name",
  "anchor": {
    "match": {
      /* regex pattern for names in all-caps */
      "pattern": "^[A-Z]* [A-Z]* ^[A-Z]*| ^[A-Z]* [A-Z]*|",
      "type": "regex"
    },
    /* stop searching before the order information section */
    "end": "order information"
  },
  "method": {
    /* use regex to extract names matching capitalized patterns */
    "id": "regex",
    "pattern": "^[A-Z]* [A-Z]* [A-Z]*|^[A-Z]* [A-Z]*",
    /* filter out all-capped, unwanted lines that might match the pattern */
    "lineFilters": [
      {
        "type": "includes",
        "text": "VERIFICATION SERVICES",
        "isCaseSensitive": true
      },
      {
        "type": "includes",
        "text": "CURRENT AS OF",
        "isCaseSensitive": true
      }
    ]
  }
}

‍

This fallback method activates when the SSN is missing from the document:
‍

Uses regex patterns to identify all-caps name formats directly
Searches the document header area before "order information"
Filters out false matches that might fit the name pattern
‍

Extracted values:
‍

Truework:


"employee_name": {
  "type": "string",
  "value": "Jack Bauer"
}

‍

Equifax:


"employee_name": {
  "type": "string",
  "value": "Shannon Brown"
}

‍

Extract employer address
‍

To extract the employer address, we’ll use the Region method for both Truework and Equifax, employing different strategies to find the region.

‍

Truework employer address extraction

Truework uses a consistent label for the employer address:

To extract this address, let’s use the following query:
‍


{
  "id": "employer_address",
  "anchor": {
    "match": {
          /* search for 'employer address' anchor */
          "text": "employer address",
          "type": "startsWith"
        },

  "method": {
    /* define a rectangular region in inches relative to the anchor, and extract all text in the region. Region is 4 inches to the right of the anchor (starting from the left edge of the anchor), -0.2 inches above it, and is 3.7" wide by 0.5" high   */
    "id": "region",
    "start": "left",
    "offsetX": 4,
    "offsetY": -0.2,
    "width": 3.7,
    "height": 0.5
  }
}

The Truework approach uses a Region method to capture the address positioned in a specific area relative to the label.

‍

Equifax employer address extraction

The Equifax employer address is multiline and has varying labels ( “headquarters address” or “address 1”).

‍

We’ll use the following query to extract this information:

‍


{
  "id": "employer_address",
  /* format output as a properly structured address */
  "type": "address",
  "anchor": {
    "match": {
      "type": "any",
      "matches": [
        {
          /* look for either address anchor format */
          "text": "headquarters address:",
          "type": "startsWith"
        },
        {
          /* our example document uses this address anchor format */
          "text": "address 1:",
          "type": "startsWith"
        }
      ]
    }
  },
  "method": {
    /* define a rectangular region to capture multi-line address */
    "id": "region",
    /* start from the left edge of the anchor */
    "start": "left",
    /* move 1.35 inches to the right of the anchor */
    "offsetX": 1.35,
    /* move slightly up from the anchor */
    "offsetY": -0.1,
    /* region is 2.5 inches wide */
    "width": 2.5,
    /* region is 0.9 inches tall to capture multiple lines */
    "height": 0.9
  }
}

‍

Extracted values:
‍

Truework:


"employer_address": {
  "type": "string",
  "value": "111 Drake Street, Livonia, MI 4423 3"
}

‍

Equifax:


"employer_address": {
  "value": "2223 Trunis Street Data not provided\nChanhassen MN 55317",
  "type": "address"
}

‍

Extract salary data
‍

To extract base pay data, we’ll use a simple row-based approach for Truework and a complex table intersection approach for Equifax.

‍

Truework base pay extraction (Year 2)

Truework uses a table where Base pay is a dedicated row:

‍

To extract the second year of base pay, use the following query:

‍


{
  "id": "basepay_2",
  /* format output as currency */
  "type": "currency",
  "anchor": {
    /* start search for anchor after employment type section */
    "start": "employment type",
    "match": {
      /* find the base pay row */
      "type": "startsWith",
      "text": "base"
    }
  },
  "method": {
    /* extract values horizontally aligned with the base pay row */
    "id": "row",
    /* look to the right of the anchor */
    "position": "right",
    /* select the second currency value (year 2 data) */
    "tiebreaker": "second"
  }
}

‍

Truework's layout allows for a straightforward approach:
‍

Anchors on the "base" salary row
Uses the Row method to extract horizontally aligned values
"tiebreaker": "second" selects the second currency value in the row (year 2)

‍

Equifax base pay extraction (Year 2)

Equifax labels base pay using a column header, not a row label:

‍

To extract this data, use the following query:

‍


{
  "id": "basepay_2",
  /* format output as currency */
  "type": "currency",
  "anchor": {
    "start": {
      /* begin search at the income summary section */
      "text": "ANNUAL INCOME SUMMARY",
      "type": "equals",
      "isCaseSensitive": true
    },
    "match": [
      {
        /* find the first year */
        "pattern": "^20\\d{2}$",
        "type": "regex"
      },
      {
        /* find the second year - this is our target column */
        "pattern": "^20\\d{2}$",
        "type": "regex"
      }
    ],
    "end": [
      /* end conditions to scope the search - find two years followed by a footer */
      {
        "pattern": "^20\\d{2}$",
        "type": "regex"
      },
      {
        "pattern": "^20\\d{2}$",
        "type": "regex"
      },
      {
        /* stop before footer or next section */
        "pattern": "^20\\d{2}$|TWN|the statement above",
        "type": "regex",
        "flags": "i"
      }
    ]
  },
  "method": {
    /* find data at the intersection of year column and base salary row */
    "id": "intersection",
    "verticalAnchor": {
      "start": {
        /* scope the vertical search to the income summary section */
        "type": "equals",
        "text": "ANNUAL INCOME SUMMARY"
      },
      "match": {
        /* find the "base" salary column */
        "text": "base",
        "type": "startsWith"
      }
    },
    /* fine-tune horizontal position of intersection point between 2nd-year row and 'base' column */
    "offsetX": 0.2,
    /* clean up spacing issues in extracted currency values */
    "whitespaceFilter": "all"
  }
}

‍

The Equifax approach handles the table’s column headers using the following strategies:
‍

The anchor finds the second year in the income table using regex pattern matching
The Intersection method locates where the year column meets the "Base Salary" row
offsetX: 0.2 fine-tunes the horizontal position of the row/column intersection to account for column text header alignment
whitespaceFilter: "all" cleans up any spacing issues in the extracted currency
‍

Extracted data:

‍

Truework:


"basepay_2": {
  "source": "$55,520.77",
  "value": 55520.77,
  "unit": "$",
  "type": "currency"
}

‍

Equifax:


"basepay_2": {
  "source": "$443.13",
  "value": 443.13,
  "unit": "$",
  "type": "currency"
}

‍

Summing up layout-based extraction strategies
‍

In this post, you’ve learned how to write a small subset of Sensible’s extraction methods and how to apply them to different document layouts:
‍

Use the Row or Label methods for cleanly labeled single-line data
Use the Region method for multi-line data in defined rectangular areas
Use the Intersection method for complex tables where you need to find data at the meeting point of rows and columns
‍

This general guidance is a bit oversimplified, but it can inform your extraction strategies for different providers:
‍

Truework employs a cleaner, more modern layout with clear visual separation and consistent labeling, allowing for simpler Row-based extraction.
‍Equifax uses a formal, dense structure typical of enterprise reporting systems, requiring precise methods like the Intersection method for tables and multiple fallback strategies for variable data positioning.

‍

Putting it all together
‍

When you run these configurations against employment verification forms, Sensible extracts all the defined fields and returns them in a structured JSON format that's ready to be integrated with your systems.

Sample output for the defined fields:
‍

Complete Truework Output:


{
  "employee_name": {
    "type": "string",
    "value": "Jack Bauer"
  },
  "employer_address": {
    "type": "string",
    "value": "111 Drake Street, Livonia, MI 4423 3"
  },
  "basepay_2": {
    "source": "$55,520.77",
    "value": 55520.77,
    "unit": "$",
    "type": "currency"
  }
}

‍

Complete Equifax Output:


{
  "employee_name": {
    "type": "string",
    "value": "Shannon Brown"
  },
  "employer_address": {
    "value": "2223 Trunis Street Data not provided\nChanhassen MN 55317",
    "type": "address"
  },
  "basepay_2": {
    "source": "$443.13",
    "value": 443.13,
    "unit": "$",
    "type": "currency"
  }
}

Extract more data
‍

We've covered how to extract a few key pieces of data from employment verification forms. Our prebuilt configurations extract much more information, including multi-year income histories, bonus and overtime details, hire dates, and reference numbers. That full extraction coverage enables use cases such as:
‍

Automated loan application processing
Real-time income verification
Integration with underwriting systems
Compliance and audit preparation
‍

Start extracting
‍

Congratulations, you've learned some key methods for extracting structured data from employment verification forms! There's more extraction power to uncover. Book a demo or check out our managed services for customized implementation support. Or explore on your own: sign up for an account, check out our prebuilt financial services templates in our open-source library, and peruse our docs to start extracting data from your own documents.

Frances Elliott

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.

Start Extracting Book a demo

How to extract data from employment verification forms with Sensible

What we'll cover
‍

Prerequisites
‍

‍
Pre-extraction provider identification
‍

Truework fingerprint

Equifax fingerprint

Extract employee name
‍

Truework employee name extraction

A note on layout-based extraction

Equifax employee name extraction (Primary method)

Equifax employee name extraction (Fallback method)

Truework:

Equifax:

Extract employer address
‍

Truework employer address extraction

Equifax employer address extraction

Truework:

Equifax:

Extract salary data
‍

Truework base pay extraction (Year 2)

Equifax base pay extraction (Year 2)

Truework:

Equifax:

Summing up layout-based extraction strategies
‍

Putting it all together
‍

Complete Truework Output:

Complete Equifax Output:

Extract more data
‍

Start extracting
‍

Turn documents into structured data

Related posts

How to extract data from CMS-1500 forms with Sensible

Splitting Multi-Document PDFs with LLMs

The opinionated guide to JsonLogic for transforming document data

How to automate human-in-the-loop review for document processing