How to capture the long tail when extracting data from paystubs

Updated on
September 2, 2025
5
min read
Contributors
No items found.
Author
How to capture the long tail when extracting data from paystubs
Table of contents
Turn documents into structured data
Get started free
Share this post

The Challenge of Paystub Processing at Scale


In the world of document processing, the Pareto principle often applies: 80% of your document volume comes from 20% of vendors or providers, while the remaining 20% of volume spans across dozens of smaller, regional, and niche vendors. This principle holds true whether you’re extracting transaction information from bank statement PDFs, energy use from utility bills, or coverages from insurance declaration pages.

In this tutorial, we'll implement an 80/20 data extraction approach for paystub document processing. For the 20% of high-volume providers like ADP, Paylocity, and Gusto, it makes perfect sense to invest in tailor-made, document-data extraction configurations that can handle their consistent document formats with speed and accuracy. But what about the long tail? Companies like BizChecks Payroll (Cape Cod regional) or OnPay (agriculture/nonprofit specialist) each have their own unique formatting that would be impractical to support with individually tailored extraction configurations.

Enter SenseML, Sensible's query language for document automation. With SenseML, you can create dedicated layout-based extraction configs tailored to your high-volume providers, then intelligently handle the long tail with an LLM-based extraction config. Whether you're processing paystub data for lending applications, expense management, or compliance reporting, you can then use Sensible to enforce a consistent data output schema of your choice across all documents. From there, your extracted data is accessible via API, the platform UI, or thousands of other software integrations through Zapier.

What We'll Cover


This tutorial focuses on comparing layout-based and LLM-based document data extraction methods, showing you how each approach handles the same data points differently and when to use each strategy. We’ll focus on a couple of major vendors (Paylocity and ADP) and use a generic paystub example for the long tail.

Sensible app showing layout-based queries, sample ADP document, and extracted document field

Sensible app showing LLM-based queries, sample generic document, and extracted document field


Sensible’s prebuilt support for paystub data extraction is comprehensive. To keep it simple, this blog post will walk you through extracting a couple key data points from paystubs using different approaches:

  • Employee address 
  • Regular pay for this pay period

We’ll use the following example documents:

High-volume format: Paylocity paystub

High-volume format: ADP paystub

Longtail formats: Generic paystub handled by LLM methods


By the end, you'll understand several SenseML methods, you’ll understand strategies for extracting data from major vendors’ documents versus the long tail, and you'll be on your way to extracting any data you choose using our documentation or our prebuilt open-source configurations.

Prerequisites

To follow along, you can sign up for a Sensible account, then import paystub PDFs and prebuilt open-source configurations directly to the Sensible app using the  Out-of-the-box extractions tutorial.

The 80/20 Extraction Strategy

Before diving into specific extraction techniques, let's understand when to use each approach:

Layout-based methods (80% of volume, 20% of vendors):

  • Suitable for high-volume document providers with consistent formats (predictable field positioning and labels)
  • Offers fast, deterministic extraction
  • Worth the investment in extraction config development

LLM-based methods (20% of volume, 80% of vendors):

  • Suitable for regional and niche providers (varying layouts and field positioning)
  • Offers faster implementation (single extraction config handles multiple providers)

Pre-extraction vendor identification

First, let's walk through identifying different paystub providers, so we:

  • route high-volume vendors’ paystubs to the appropriate extraction configs and 
  • let the long-tail ‘fall through’ to the LLM-based configs.


We'll use “fingerprints” for this classification process. (Note that classifying the document generally as a paystub happens upstream and isn't covered in this tutorial.) Fingerprints help Sensible quickly determine the appropriate layout-based or LLM-based extraction queries to use before attempting to extract data from a document.

Paylocity fingerprint

The Paylocity fingerprint tests for the presence of the "paylocity" text anywhere in the document, which reliably identifies documents from this provider due to their consistent branding:


{
  "fingerprint": {
    "tests": [
      {
         /* check if the text 'paylocity' is present in the document */
        "type": "includes", 
        "text": "paylocity"
      }
    ]
  }
}

ADP fingerprint

The ADP fingerprint uses two distinctive elements: the exact "Earnings Statement" title and the "Period Ending:" label format. Both tests must pass for Sensible to identify this as an ADP document and route it to the appropriate extraction queries:


{
  "fingerprint": {
    "tests": [
      {
        /* are both the phrases "Earnings Statement" and "Period Ending" present in document? */
        "type": "equals",
        "text": "Earnings Statement",
        "isCaseSensitive": true
      },
      {
        "type": "startsWith", 
        "text": "Period Ending:",
        "isCaseSensitive": true
      }
    ]
  }
}

When a document doesn't match any fingerprints (like the generic paystub example in this post), it automatically routes to the LLM-based config.

Extract Employee Address: Spatial vs. Semantic Approaches

Let’s start by extracting the employee address. You’ll learn how to extract this multi-line, variable-length field with an LLM-based approach, then with varying layout-based, or spatial, approaches.

General LLM Approach

In the generic example document, the single-line employee address is near the employee name:

Generic paystub employee address

To extract the address from long-tail vendors, the LLM-based config uses a primary LLM prompt and handles LLM errors with a fallback prompt:


{
  /* Sensible uses JSON5 to support in-line comments */
  "fields": [
    {
      "method": {
        "id": "queryGroup",
        "searchBySummarization": "page",
        "queries": [
          {
            "id": "employee_address",
            /* primary prompt: attempt to extract complete address with formatting instructions */
            "description": "employee's address. do not return company's address. if two street addresses are listed, return the second one. return street, city, state and zip, separated with whitespaces",
            "type": "address"
          },
          {
            "id": "_employee_address.street_address",
            /* fallback prompt 1: simpler extraction of just the street address */
            "description": "return the employee street address"
          },
          {
            "id": "_employee_address.city_state_zipcode",
            /* fallback prompt 2: simpler extraction of just city, state, zip */
            "description": "return the employee city, state, and zipcode, separated with whitespaces"
          }
        ]
      }
    },
    {
      "method": {
        "id": "queryGroup",
        /* if the previous `employee_address` field returns null, then chain prompts: use output from fallback address extraction fields as input */
        "source_ids": { 
          /* regular expression to find output of all source field IDs that include "_employee_address" */
          "pattern": ".*employee_address.*" 
        },
        "confidenceSignals": false,
        "queries": [
          {
            /* final prompt: combine and reformat the component address parts */
            "id": "employee_address",
            "type": "address",
            "description": "full employee address. return street, city, state and zip, separated with whitespaces"
          }
        ]
      }
    }
  ]
}

This approach uses natural language processing to locate the employee address regardless of positioning, even providing instructions to handle edge cases like multiple addresses. It also provides error handling by chaining prompts in an agentic fashion: If the LLM can’t find the complete address in the first prompt, it falls back to extracting each part of the address with simpler prompts, then concatenates them together.

Extracted output:

Sensible returns the following results, along with a qualitative measure of confidence in the accuracy of the LLM’s answer.


  "employee_address": {
    "value": "456 Center St\nSan Mateo CA 94402",
    "type": "address",
    "confidenceSignal": "confident_answer"
  }

Paylocity: Region Method

Now let’s turn from generalized LLM methods to a highly specific, layout-based approach for a big-name vendor, Paylocity.

Paylocity always places employee information in the area bounded by the green box in the following image:

Paylocity employee address with region highlighting

Since the region is consistent, we can extract all the text from that rectangular box using the Region method:


{
  "fields": [
    {
      "id": "employee_address",
      /* format the output as an address, and exclude non-address text */
      "type": "address", 
      /* An anchor is text that helps Sensible locate the target data in the document.
         It acts as a reference point - like a landmark - that's consistently positioned near the information you want to extract */
      "anchor": {
        "match": {
          "text": "employee id",
          "type": "startsWith"
        }
      },
      "method": {
        /* Define a rectangular region in inches relative to the anchor and extract text in that area */
        "id": "region",
        "start": "right",
        "offsetX": 1.2,
        "offsetY": -0.6,
        "width": 4,
        "height": 0.35
      }
    }
  ]
}

This extraction method anchors on "employee id" and extracts text from a rectangular area. It both formats the output and filters out unwanted text using the Address type. The Region method is fast and reliable when you know exactly where the address appears.

Extracted output:


  "employee_address": {
    "value": "11234 Fake Road\nLivonia, Mi 41870",
    "type": "address"
  },

ADP: Region with Filtering

ADP’s layout of the employee address is also suited to extraction using the Region method:

ADP employee address with region highlighting

ADP address extraction with region:


{
  "id": "employee_address",
  "type": "address",
  "anchor": {
    "match": {
      "text": "pay date",
      "type": "startsWith"
    }
  },
  "method": {
    "id": "region",
    "start": "left",
    "offsetX": -0.1,
    "offsetY": 0.1,
    "width": 3,
    "height": 1.2,
      }
}

The extraction region’s inch coordinates and anchor differ from those used in the Palocity extraction config, but the principle is the same.

Extracted output:


  "employee_address": {
    "value": "223 Ash Drive\nBrenda CA 84880",
    "type": "address"
  },

Schema normalization across LLM- and layout-based extractions

Note that across the generalized LLM-based extraction configuration and the layout-based extractions, you keep the extracted data schema consistent. For example, you use the same id for each field you extract ( employee_address) and enforce consistent formatting for the value output using the Address type. There’s more power to be unlocked here too – you can specify any JSON output schema you need using postprocessors, making it easy for your application to handle data from any paystub provider with the same business logic.


{
 /* output from generic LLM extraction config */
 "employee_address": {
    "value": "456 Center St\nSan Mateo CA 94402",
    "type": "address",
    "confidenceSignal": "confident_answer"
  },

  /* output from Paylocity extraction config */
  "employee_address": {
    "value": "11234 Fake Road\nLivonia, Mi 41870",
    "type": "address"
  },

  /* output from ADP extraction config */

  "employee_address": {
    "value": "223 Ash Drive\nBrenda CA 84880",
    "type": "address"
  }
}

Extract regular pay

Let’s walk through extracting the pay stub’s regular pay for the current period to show how layout-based methods excel at precise table extraction, while LLM methods provide universal flexibility.

Chain LLM prompts to extract regular pay

The LLM approach is to first extract the pay table, then chain prompts to extract individual cells from the table. We extract individual cells in this instance in order to normalize the output schema against the layout-based extraction configs, which we’ll explore in following sections.

Generic paystub: regular pay


{
  "id": "_earnings_table",
  "type": "table",
  "method": {
    /* first, extract the entire earnings table to narrow down the search context for subsequent prompts */
    "id": "list",
    "searchBySummarization": "page",
    "description": "earnings table",
    "properties": [
      {
        "id": "earnings_type",
        "description": "earnings type such as regular, normal, additions"
      },
      {
        "id": "rate",
        "description": "rate"
      },
      {
        "id": "hours",
        "description": "hours"
      },
      {
        "id": "amount",
        "description": "current amount"
      },
      {
        "id": "ytd",
        "description": "year to date amount"
      }
    ]
  }
}

This approach works regardless of table structure, label variations, or positioning.

Extracted results:

Sensible extracts the earnings table:


 "_earnings_table": {
    "columns": [
      {
        "id": "earnings_type",
        "values": [
          {
            "value": "Normal Gross",
            "type": "string"
          },
          {
            "value": "Addi ions",
            "type": "string"
          },
          {
            "value": "Docks",
            "type": "string"
          },
          // etc
        ]
      },
      {
        "id": "rate",
        "values": [
          null,
          null,
          null,
          // etc
        ]
      },
      {
        "id": "hours",
        "values": [
          null,
          null,
          null,
          // etc
        ]
      },
      {
        "id": "amount",
        "values": [
         {
            "value": "6,715.86",
            "type": "string"
          },

          {
            "value": "0.00",
            "type": "string"
          },
          {
            "value": "0.00",
            "type": "string"
          },
          // etc
        ]
      },
      {
        "id": "ytd",
        "values": [
          {
            "value": "53,726.88",
            "type": "string"
          },
          {
            "value": "6,384.20",
            "type": "string"
          },
          {
            "value": "0.00",
            "type": "string"
          },
          // etc
        ]
      }
    ]
  },

To find the regular pay in the extracted table, use the following chained prompt:


{
  "method": {
    "id": "queryGroup",
    /* limit queries to the extracted earnings table instead of searching the whole document */
    "source_ids": ["_earnings_table"],
    "queries": [
      {
        "id": "pay_this_period.regular",
        /* extract individual data points and apply currency type to normalize with other configs' schema output */
        "description": "regular pay for this period",
        "type": "currency"
      }
    ]
  }

Extracted output:


 "pay_this_period.regular": {
    "source": "6,715.86",
    "value": 6715.86,
    "unit": "$",
    "type": "currency"
  }

Paylocity and ADP: Intersection method

In contrast to the long tail, Paylocity organizes pay in a predictable table structure, perfect for the layout-based Intersection method:

Paylocity paystub regular pay


{
 "id": "pay_this_period.regular",
 /* format output as currency */
 "type": "currency",
 "anchor": {
   "match": {
     /* find the row starting with 'Regular' */
     "text": "Regular",
     "type": "startsWith"
   }
 },
 "method": {
    /* find the cell at the intersection of the row label ("regular) and the AND column label ("amount") */
   "id": "intersection",
   "verticalAnchor": {
     /* start searching for the column heading after the text 'period ending' */
     "start": "period ending",
     "match": {
       /* find the Amount column header */
       "text": "amount",
       "type": "startsWith"
     }
   }
 
 }
}

This config finds a "regular pay" row within an earnings section and extracts a currency value at the intersection of that row and an “amount” column.


  "pay_this_period.regular": {
    "source": "4,278.47",
    "value": 4278.47,
    "unit": "$",
    "type": "currency"
  },

The ADP extraction config also uses the Intersection method since it has a similar layout:

ADP paystub regular pay

Note that ADP takes a slightly different approach in determining which row and column labels to target for the table cell intersection. Since it’s more reliable to match on single-word columns, ADP takes the approach of matching the “hours” column and then offsetting by a fraction of an inch (.07) to match the “this period” column:


{
"id": "pay_this_period.regular",
      "type": "currency",
      /* find the 'regular' row */
      "anchor": {
        "start": {
          /* start looking for the row after the text 'earnings' */
          "text": "earnings",
          "type": "startsWith"
        },
        "match": {
          "text": "regular",
          "type": "startsWith"
        }
      },


"method": {
        "id": "intersection",
        "verticalAnchor": {
          "match": {
            "type": "any",
            "matches": [
              /* to account for form variations, match a column labeled either 'hours' or 'salary/hours' */
              {
                "text": "hours",
                "type": "startsWith"
              },
              {
                "text": "salary/hours",
                "type": "startsWith"
              }
            ]
          }
        },
       /* grab the cell 0.7 inchs to the right of the matched column */
        "offsetX": 0.7
      }
}

Extracted output:


"pay_this_period.regular": {
    "source": "3,237.76",
    "value": 3237.76,
    "unit": "$",
    "type": "currency"
  },

Congratulations! You’ve walked through extracting a couple data points from paystubs, and learned different strategies for generalized LLM-based extractions versus tailored, layout-based extraction strategies.

Implementing the 80/20 Strategy

To implement this approach in your document processing system:

  1. Identify your high-volume providers - Analyze your document volume to find which vendors represent 80% of your paystubs
  2. Create layout-based configs - Build precise, fast configs for your top providers using methods like Region, Row, Table methods, and Label
  3. Develop an LLM fallback - Create a general config using Query Group or List methods that can handle any paystub format
  4. Use fingerprinting - Implement document classification to route each paystub to the appropriate config
  5. Monitor and optimize - Track extraction accuracy and adjust configs as needed

This hybrid approach maximizes both efficiency and coverage, ensuring you can process paystubs from any vendor while maintaining optimal performance for your highest-volume sources.

Conclusion

The 80/20 principle provides an optimal strategy for document data extraction at scale. By using layout-based methods for consistent, high-volume providers and LLM-based methods for the diverse long tail, you can build an automated document processing system that's both fast and comprehensive.

Ready to implement document extraction in your application? Book a demo to see how Sensible can help you build a robust document processing pipeline, or check out our managed services for customized implementation support. Or explore on your own: sign up for an account, check out our prebuilt extraction templates in our open-source library, and peruse our docs to start extracting data from your own documents.

Frances Elliott
Frances Elliott
Turn documents into structured data
Get started free
Share this post

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.