How to extract data from invoices with Sensible

Updated on
March 13, 2026
5
min read
Contributors
No items found.
Author
How to extract data from invoices with Sensible
Table of contents
Turn documents into structured data
Get started free
Share this post

Invoices look like a solved problem. Every business issues them, every AP team processes them, and every ERP expects them. The reality at scale is different: invoices are one of the highest-variability document types in production, and that variability is what breaks most extraction pipelines.

Invoices are vendor-issued payment requests that document line items, quantities, prices, and payment instructions. At volume, extracting data from them manually introduces keying errors, delays AP cycles, and creates audit gaps. Sensible's hybrid extraction config handles format variability across vendors and returns typed, schema-validated output through a single API endpoint.

Every vendor formats differently. Field positions shift. Line item tables have inconsistent column counts, merged cells, and irregular row spacing. Invoices arrive bundled in the same PDF alongside packing slips, purchase orders, and remittance advice — requiring document detection before extraction even begins. Scan quality ranges from clean digital exports to phone photos of crumpled paper, and partially handwritten invoices are common in certain verticals.

Sensible handles this with a two-tier approach. A generalized LLM-powered template covers the long tail of vendor formats out of the box: no per-vendor configuration required, ready to extract on day one. For vendors whose invoices are high-volume or consistently underperforming on the generalized template, a layout-specific template can be built in 15 to 45 minutes depending on field count and document complexity. Both approaches run through the same API. You get breadth from the generalized template and precision where the volume justifies it.

The examples below use a real commercial invoice from INCAP to show the generalized template approach in action across five field types.


What we'll cover:

  • Vendor and customer identification with LLM field-level disambiguation
  • Invoice header fields with automatic date normalization
  • Line items using the List and Zip methods
  • Payment details with conditional output logic
  • Invoice total with explicit LLM provider selection


Prerequisites

  1. Sign up for a Sensible account
  2. Add invoice extraction support via the Out-of-the-box extractions quickstart
  3. Gather sample invoices from one or more vendors



Write document extraction queries with SenseML

SenseML is Sensible's configuration language for document extraction. Each field in your config defines how to locate and extract a value from the document. A complete config has the top-level shape { "fields": [ { "id": "...", "method": { ... } }, ... ] } — the examples below show individual field objects that slot into that array.

The examples below primarily use 2 methods you'll see throughout this post:

  • Query Group: pass an LLM prompt to locate fields whose positions shift across vendors, returning typed, schema-validated output
  • List: extract variable-length arrays like line items by describing each property in plain language, without requiring a fixed column layout

The generalized template uses Query Group and List throughout; layout-specific templates layer in deterministic methods for the fields that don't move.




Extract vendor and customer details

The vendor name appears near the logo or in a supplier header area on most invoices. The challenge: a single invoice also contains consignee, "Bill To," and "Ship To" blocks referencing other company names. Without precise instruction, an LLM can pull the wrong one.

The Query Group method accepts an LLM prompt in the  description  field that guides where and how to extract each value. Grouping co-located fields in a single Query Group call gives the LLM more context about their relationships, which improves disambiguation accuracy compared to querying each field independently.


Here are the queries we'll use:


{
  "id": "vendor_customer_info",
  "method": {
    "id": "queryGroup",
    "llmEngine": { "provider": "open-ai" },  // routes these queries to OpenAI
    "confidenceSignals": true,               // adds a confidence rating to each extracted field
    "queries": [
      {
        "id": "Vendor name",
        "description": "Read vendor name only from the header/supplier area near the logo. Do not use billing or shipping blocks.",
        "type": "string"
      },
      {
        "id": "Vendor address",
        "description": "From the header/supplier area: return the vendor address only (no company name). Ignore 'Bill To', 'Ship To', and delivery addresses.",
        "type": "string"
      },
      {
        "id": "Customer name",
        "description": "Extract the consignee or buyer company name from the consignee block. Do not use the vendor or exporter name.",
        "type": "string"
      },
      {
        "id": "Customer address",
        "description": "Extract the consignee or buyer address from the consignee block. Return street, city, postal code, and country.",
        "type": "string"
      }
    ]
  }
}

Setting  confidenceSignals: true  adds a  confidenceSignal  property to each output field. A value of  "confident_answer"  indicates a clear match; a lower confidence or null signals the field warrants human review before it reaches your ERP or AP system.

Extracted value:


{
  "Vendor name": {
    "value": "INCAP CONTRACT MANUFACTURING SERVICES PVT LTD",
    "type": "string",
    "confidenceSignal": "confident_answer"
  },
  "Vendor address": {
    "value": "Pandithanhalli, Hirehalli Post, Tumkur, India",
    "type": "string",
    "confidenceSignal": "confident_answer"
  },
  "Customer name": {
    "value": "MG Energy Systems B.V.",
    "type": "string",
    "confidenceSignal": "confident_answer"
  },
  "Customer address": {
    "value": "Foeke sjoerdswei 3, NL-8914 BH, Leeuwarden, The Netherlands",
    "type": "string",
    "confidenceSignal": "confident_answer"
  }
}



Extract invoice header fields

Invoice number and invoice date appear on virtually every invoice, but their labels and positions vary by vendor. Wrapping them in a single Query Group call with  searchBySummarization  enabled submits the full document as context for short invoices like this one — for documents five pages or under, Sensible feeds the entire document to the LLM directly. For longer documents, Sensible summarizes first to identify the most relevant page before extracting, which is especially useful when header fields and payment details are spread across pages.

The  "type": "date"  declaration normalizes dates to ISO 8601 in the output. One caveat:  type: "date"  assumes MM/DD/YYYY by default. For vendors using international formats (DD/MM/YYYY is common on invoices from India, Europe, and elsewhere), silent misparsing is possible: April 12 could be read as December 4. When date format is ambiguous for your vendor mix, use  type: "string" instead and include formatting instructions in the  description to let the LLM handle normalization explicitly.


Here are the queries we'll use:


{
  "method": {
    "id": "queryGroup",
    "searchBySummarization": true,  // submits full doc for ≤5 pages; summarizes first for longer documents
    "confidenceSignals": true,
    "queries": [
      {
        "id": "Invoice number",
        "description": "Extract the invoice number from this document",
        "type": "string"
      },
      {
        "id": "Invoice date",
        "description": "Extract the invoice date and return it in YYYY-MM-DD format if available",
        "type": "date"              // normalizes to ISO 8601; use "string" if vendor date format is ambiguous
      },
      {
        "id": "Invoice due date",
        "description": "Extract the invoice due date and return it in YYYY-MM-DD format if available",
        "type": "date"
      }
    ]
  }
}

Extracted value:


{
  "Invoice number": {
    "value": "EHTP/2112300047",
    "type": "string",
    "confidenceSignal": "confident_answer"
  },
  "Invoice date": {
    "source": "2023-04-12",
    "value": "2023-04-12T00:00:00.000Z",
    "type": "date",
    "confidenceSignal": "confident_answer"
  },
  "Invoice due date": null
}

The  source  field preserves the raw text from the document;  value  is the normalized ISO 8601 output. When  Invoice due date  is absent, the field returns  null  rather than populating with a guess.




Extract line items

Line item tables are the most structurally variable part of an invoice. Column count, header labels, row spacing, and whether items are grouped by PO all vary by vendor. The List method extracts repeating structured data by describing each property in plain language, without requiring a fixed column layout or header text to anchor against.


Here are the queries we'll use:


{
  "id": "items",
  "type": "table",
  "method": {
    "id": "list",
    "searchBySummarization": true,
    "description": "Extract the information of all items in the invoice",
    "properties": [
      { "id": "item_number", "description": "Item code or Part Number, if included" },
      { "id": "item_description", "description": "Item name or description" },
      {
        "id": "item_unit_quantity",
        "description": "Quantity of the item purchased. If specified, use only shipped quantities.",
        "type": "number"
      },
      { "id": "item_unit_price", "description": "The unit price or per-item price", "type": "currency" },
      { "id": "item_total", "description": "The total amount of the line item", "type": "currency" }
    ]
  }
}

The List method returns parallel arrays (one per property, indexed by row). The Zip method restructures these into an array of row objects, where each object contains all properties for a single line item:


{
  "id": "line_items",
  "method": {
    "id": "zip",
    "source_ids": ["items"]  // restructures parallel arrays from List into row objects
  }
}

Extracted value:


{
  "line_items": [
    {
      "item_number": { "value": "MG3000352", "type": "string" },
      "item_description": { "value": "MG3000352 [Rev. A] - Enclosure LFP304", "type": "string" },
      "item_unit_quantity": { "source": "800", "value": 800, "type": "number" },
      "item_unit_price": { "source": "112.02", "value": 112.02, "unit": "USD", "type": "currency" },
      "item_total": { "source": "89616.00", "value": 89616.00, "unit": "USD", "type": "currency" }
    },
    {
      "item_number": { "value": "MG3000352", "type": "string" },
      "item_description": { "value": "MG3000352 [Rev. A] - Enclosure LFP304", "type": "string" },
      "item_unit_quantity": { "source": "50", "value": 50, "type": "number" },
      "item_unit_price": { "source": "112.02", "value": 112.02, "unit": "USD", "type": "currency" },
      "item_total": { "source": "5601.00", "value": 5601.00, "unit": "USD", "type": "currency" }
    },
    {
      "item_number": { "value": "MG3000255", "type": "string" },
      "item_description": { "value": "LFP24V 230A Enclosure", "type": "string" },
      "item_unit_quantity": { "source": "250", "value": 250, "type": "number" },
      "item_unit_price": { "source": "78.90", "value": 78.90, "unit": "USD", "type": "currency" },
      "item_total": { "source": "19725.00", "value": 19725.00, "unit": "USD", "type": "currency" }
    }
  ]
}

The intermediate items field is suppressed from the final output using the Suppress Output method, keeping the API response clean.


Extract payment details

Vendor invoices frequently include bank transfer instructions: account holder name, bank name, SWIFT/BIC code, and account number. Each gets its own query within a Query Group call.

Here are the queries we'll use:


{
  "method": {
    "id": "queryGroup",
    "searchBySummarization": true,
    "confidenceSignals": true,
    "queries": [
      {
        "id": "Bank_name",
        "description": "Extract the name of the bank provided in the payment details on the invoice",
        "type": "string"
      },
      {
        "id": "Bank Swift code",
        "description": "Extract the bank SWIFT/BIC code provided in the payment details section.",
        "type": "string"
      },
      {
        "id": "Bank account number",
        "description": "Extract the bank account number listed in the payment instructions",
        "type": "string"
      },
      {
        "id": "Bank_account_name_raw",
        "description": "Extract the account holder name from the payment details section",
        "type": "string"
      }
    ]
  }
}

The bank name requires one additional step: it should only return when a bank name is also present. A Custom Computation field handles this with a JSON Logic conditional:


{
  "id": "Bank account name",
  "method": {
    "id": "customComputation",
    "jsonLogic": {
      "if": [
        { "!==": [{ "var": "Bank_name" }, null] },     // only return when Bank_name was extracted
        { "var": "Bank_account_name_raw.value" },
        null
      ]
    }
  }
}

Extracted value:


{
  "Bank_name": {
    "value": "Axis Bank Ltd",
    "type": "string",
    "confidenceSignal": "confident_answer"
  },
  "Bank account name": { "value": "INCAP IND", "type": "string" },
  "Bank Swift code": {
    "value": "XXXX-XXXX-XXXX",
    "type": "string",
    "confidenceSignal": "confident_answer"
  },
  "Bank account number": {
    "value": "XXXXXXXXXXXXXXX",
    "type": "string",
    "confidenceSignal": "confident_answer"
  }
}

Bank details are redacted from the sample output.

Custom Computation with JSON Logic is Sensible's mechanism for reconciliation logic across fields: conditioning one field's output on another's value, enforcing cross-field rules, or calculating derived values. It runs after LLM extraction on the structured output, so the logic is deterministic even when the upstream extraction is LLM-based.


Extract invoice total

Invoice totals require careful extraction. A single document often contains multiple currency figures: subtotals, tax amounts, balance due, and running totals. The description instructs the LLM to return only the final due amount, not a balance or partial figure.

Sensible supports OpenAI, Anthropic, and Google Gemini as LLM providers, giving you the flexibility to route individual queries to whichever model performs best for a given field type. For numeric extraction where the target value is surrounded by similar figures (subtotals, running totals, tax lines), switching providers can improve disambiguation accuracy on your specific document set. Setting  llmEngine: { "provider": "anthropic" }  routes this query to Anthropic's models while the rest of the config uses the default provider. If you haven't tested both, omit  llmEngine  and Sensible uses its default.


Here are the queries we'll use:


{
  "method": {
    "id": "queryGroup",
    "llmEngine": { "provider": "anthropic" },  // routes this query to Anthropic; omit to use the default provider
    "searchBySummarization": true,
    "confidenceSignals": true,
    "queries": [
      {
        "id": "Total amount of invoice",
        "description": "Extract the total due amount from this invoice and return it. Dismiss Balance, only total amount",
        "type": "number"
      }
    ]
  }
}

Extracted value:


{
  "Total amount of invoice": {
    "source": "114,942.00",
    "value": 114942,
    "type": "number",
    "confidenceSignal": "confident_answer"
  }
}

The  source  field preserves the original formatted string from the document; value is the parsed number, typed and ready for financial calculations or reconciliation workflows. Cross-checking against the line item totals (89,616 + 5,601 + 19,725 = 114,942) confirms the extraction. Sensible also has built-in validation capabilities, so this kind of cross-field check can be encoded directly in your template rather than handled downstream.



When to build a layout-specific template

The generalized LLM template above covers the long tail of vendor formats with no per-vendor configuration. For most teams, it handles the majority of invoice volume on day one.

Two signals indicate a vendor warrants a layout-specific template:

  • Volume: a vendor accounts for a meaningful share of your total invoice volume
  • Accuracy: the generalized template consistently underperforms on that vendor's format

When either condition applies, a layout-specific template takes 15 to 45 minutes to configure depending on invoice complexity and the number of fields. Layout templates use deterministic methods for fields with fixed positions on that vendor's format, which improves extraction accuracy and removes the LLM call cost for those fields entirely. Both the generalized template and any layout-specific templates run through the same API endpoint. Sensible selects the right template automatically at extraction time.


Extract more data

Any field present on an invoice can be extracted with Sensible. The five sections above cover core invoice fields. A complete extraction config can also pull PO number, payment terms, currency code (normalized to ISO 4217), tax amount, subtotal, ship date, and other data. Sensible's open-source configuration library includes a prebuilt invoice config to use as a starting point and extend for your specific vendor mix.

To build a custom config from scratch, the SenseML reference covers every available extraction method. If you'd rather have Sensible's team handle configuration, testing, and ongoing maintenance, managed services gets you fully set up.


Connect Sensible to your workflow

Once your SenseML config is set up, there are several ways to integrate invoice extraction into your application or process.

Python SDK

The Sensible Python SDK wraps the extraction API for Python applications. Install with pip and pass a file path or URL to get back a typed  parsed_document object:


pip install sensibleapi


from sensibleapi import SensibleSDK

sensible = SensibleSDK(YOUR_API_KEY)  # replace with your key

request = sensible.extract(
    path="./invoice.pdf",
    document_type="invoices",
    environment="production"
)

results = sensible.wait_for(request)
parsed = results["parsed_document"]

print(parsed["Vendor name"]["value"])
print(parsed["Total amount of invoice"]["value"])

For async processing at volume, configure a webhook instead of polling with  wait_for. See the Python SDK docs for the full reference.

MCP server

Sensible's MCP server connects document extraction directly to AI coding tools like Claude, letting you query and extract invoice data through natural language without writing API calls. See the MCP server docs for setup instructions.

API (synchronous and asynchronous)

Call the Sensible REST API directly for language-agnostic integration. The synchronous endpoint returns extracted data inline; the asynchronous endpoint accepts a webhook URL and posts results when extraction completes, recommended for high-volume or large-document workflows. See the API reference for endpoint details.

Zapier

For no-code integration, Sensible's Zapier connector routes extracted invoice data into existing workflows without writing code, connecting to Google Sheets, Airtable, Slack, or any of Zapier's connected apps. See the Zapier integration docs to get started.


Frequently asked questions

What fields can be extracted from an invoice?

Core fields include vendor name and address, customer name, invoice number, invoice date, line items (description, quantity, unit price, line total), payment terms, bank details, subtotal, tax amount, and total due. A complete config also pulls PO number, currency code (ISO 4217), ship date, port of loading, and more.

Can Sensible handle invoices from multiple vendors?

The generalized template covers the long tail of vendor formats with no per-vendor configuration. For high-volume or consistently inaccurate vendors, a layout-specific template takes under an hour to configure. Both run through the same API; Sensible selects the right config automatically.

What format does extracted invoice data come out in?

JSON with typed values: dates as ISO 8601, currency amounts as numbers, line items as arrays of row objects. Every query group field includes a confidenceSignal  when enabled. The output is schema-validated, so downstream systems receive consistent shapes regardless of how the original invoice was formatted.

How long does it take to set up invoice extraction with Sensible?

The generalized invoice template is ready immediately from Sensible's open-source configuration library and can be expanded on. Layout-specific templates for individual vendors take under an hour to configure depending on field count and document complexity.

Do I need to train a model to extract data from invoices?

No model training required. SenseML configs define extraction logic using Query Group and List methods for variable-layout fields and deterministic methods for fixed-position fields. Sensible manages model lifecycle, so your configs continue working when foundation models are updated or deprecated.


Start extracting

Download the prebuilt invoice config from Sensible's open-source library and run it against your own vendor samples. The config ships with the Query Group and List method setup shown above, plus additional fields for PO number, currency code, payment terms, and subtotals. Adjust the descriptions or add layout-specific templates for your highest-volume vendors as needed.

Invoices are one document type in a broader AP automation pipeline. Sensible also handles purchase orders, remittance advice, vendor statements, and other relevant doc types from the same vendor set, through the same API.

Start your free 2-week trial at https://app.sensible.so/register/

Want to walk through your specific vendor formats or document volume? Book a meeting at https://www.sensible.so/contact-us

Jason Auh
Jason Auh
Turn documents into structured data
Get started free
Share this post

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.