How a Spanish compliance startup automated extraction from 80-page corporate documents

Updated on

February 24, 2026

min read

Contributors

No items found.

Author

Frances Elliott

Table of contents

Corporate compliance requires meticulous documentation. When companies transfer shares, reduce capital, or bring on new shareholders, the legal documents recording these transactions become permanent records that auditors, tax authorities, and compliance teams reference for years.

‍

These documents are dense. A single corporate deed might run 80 pages, mixing company information, shareholder demographics, share allocations, and legal boilerplate across dozens of sections. For a Spanish compliance platform helping clients reconcile their records, extracting structured data from these documents was a manual nightmare, until they partnered with Sensible to combine LLMs with deterministic guardrails.

‍
‍

The challenge: Long documents, scattered data, multiple languages
‍

The company builds software that helps Spanish businesses with tax compliance and regulatory reporting, including reconciling shareholder records against official corporate documents before audits surface discrepancies. Getting accurate data out of those source documents (known as "escrituras", or deeds, in Spain) was a challenge:
‍

Sprawling and multilingual. A typical deed runs 50–80 pages, with shareholder demographics scattered far from their share allocations. Documents arrive in Spanish, Catalan, French, or a mix of languages, requiring country names, cities, and regional codes to be normalized regardless of source language.
‍

Fuzzy identity matching. The same shareholder might appear as "María Cristina Beso Arnalot" on page 10 and "Maria C. Beso" on page 50. Middle names, accent marks, and abbreviations vary throughout.

‍

The solution: LLMs constrained by structure
‍

The implementation is approximately 80% LLM-based, but it's the remaining 20% deterministic methods that make all the difference:

‍

Sections as subdocuments. Corporate deeds follow predictable structures, with clearly demarcated headers. The implementation uses these natural boundaries to create sections: document slices where LLM queries focus on specific content. Rather than asking an LLM to find shareholder demographics somewhere in an 80-page document, the system first deterministically identifies the relevant section, then queries only within it. The result is faster extraction, lower costs, and more reliable results.
‍

Merging scattered data with intelligent joining. Shareholder demographics appear early in a deed; share allocations appear much later. The system runs multiple LLM extraction passes, then merges results by shareholder name. But names aren't reliable join keys, because spelling variations, missing middle names, and inconsistent formatting mean exact matching fails constantly. Sensible conducted iterative testing till the agentic LLMs reliably make the judgment calls that fuzzy string matching would get wrong. Now the LLM agents reliably merge five separate data lists into unified shareholder records.
‍

Multilingual normalization. Deterministic lookup tables map language variants to standard codes across Spanish, Catalan, and French—so "ESPAÑA," "ESPANYA," and "SPAIN" all resolve to the same country code, and "ALEMANYA" and "ALEMANIA" both become "DE."
‍

The results: From hours to minutes
‍

What previously required reading an 80-page document cover-to-cover now completes in minutes. Approximately 50 data points are extracted per document, including company identifiers, incorporation details, shareholder demographics, share allocations, transaction specifics, and director information.

‍

The volume is modest at roughly six documents per week, but the value per document is enormous. Each extraction replaces hours of manual work and feeds directly into compliance workflows where accuracy matters. The partnership has evolved over two years, with custom features built to handle edge cases and five document configurations in production.
‍
‍

Key takeaways for complex document extraction
‍

Structure enables intelligence. LLMs work better when deterministically constrained to specific sections rather than navigating entire documents.
Low volume doesn't mean low value. A document that takes hours to process manually is worth automating even if you only see a few per week.
‍Complex use cases require partnership. Some documents are hard enough that off-the-shelf solutions won't work, and a partnership is needed to get to production-ready extraction.

Frances Elliott

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.

Start Extracting Talk to our team

Take a look at some other helpful articles and tutorials.

View all

min read

How a construction compliance company tackled "impossible" payroll report extraction

A construction compliance software company had given up on automating payroll report data extraction after every vendor they tested failed to handle the documents' OCR challenges, format inconsistencies, and dense tabular layouts. Sensible solved the problem by combining deterministic layout-based methods with targeted LLM extraction for trouble spots, along with aggressive text preprocessing and customer-specific configurations rather than one-size-fits-all templates. The key lesson: hybrid approaches and "good enough" automation with some manual review beat either holding out for a perfect universal solution or resigning yourself to pure manual entry.

min read

How a healthcare benefits company replaced a failing vendor and went live with 30 configurations in weeks

A healthcare benefits company processing Explanation of Benefits (EOB) documents from 20–40 insurance carriers needed a fast, accurate way to extract and normalize data across all those formats. By using Sensible's layout-based extraction methods alongside a postprocessing layer to enforce a consistent output schema, they went from signed contract to 30 live configurations in under two months. The implementation succeeded by matching deterministic extraction techniques to well-structured documents, proving that deterministic methods, applied thoughtfully, outperform more complex LLM approaches when the documents don't require them.

min read

How a point-of-sale lender tackled the messiest document type in financial services

A Canadian point-of-sale lender needed a faster, cheaper way to extract banking data from void checks—documents that routinely arrive as blurry photos, screenshots, and degraded scans. Their solution layers template-based extraction for major banks, regex parsing for standardized MICR data, and LLM-based extraction for unstructured fields, with validation logic that returns null rather than a plausible-looking wrong answer. The result replaced a slow, expensive human-review process with one that completes in seconds—at significantly lower cost and with broader bank coverage.

min read

How two healthcare organizations automated document processing

Two healthcare organizations automated their document processing using Sensible. Company A, processing cardiac device reports for organ donation screening, built 50+ extraction configurations entirely on their own through self-service, reaching production faster than almost any other customer. Company B, a pathology lab handling handwritten requisition forms, partnered closely with Sensible's team to build configurations with sophisticated normalization logic that could handle handwriting and variable scan quality. Both chose deterministic, layout-based extraction methods (Company A for transparency and independence, Company B for consistency and accuracy), while Company B also built a generalized LLM extraction configuration for edge cases.

Turn documents into structured data

Related posts

How a construction compliance company tackled "impossible" payroll report extraction

How a healthcare benefits company replaced a failing vendor and went live with 30 configurations in weeks

How a point-of-sale lender tackled the messiest document type in financial services

How two healthcare organizations automated document processing